-
-
Notifications
You must be signed in to change notification settings - Fork 7.7k
[Quantization] Add compressed-tensors NVFP4 support #18312
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Conversation
👋 Hi! Thank you for contributing to the vLLM project. 💬 Join our developer Slack at https://slack.vllm.ai to discuss your PR in #pr-reviews, coordinate on features in #feat- channels, or join special interest groups in #sig- channels. Just a reminder: PRs would not trigger full CI run by default. Instead, it would only run Once the PR is approved and ready to go, your PR reviewer(s) can run CI to test the changes comprehensively before merging. To run CI, PR reviewers can either: Add 🚀 |
@robertgshaw2-redhat @mgoin can I get a ready |
vllm/model_executor/layers/quantization/compressed_tensors/compressed_tensors.py
Outdated
Show resolved
Hide resolved
vllm/model_executor/layers/quantization/compressed_tensors/compressed_tensors.py
Show resolved
Hide resolved
is_float_type = (weight_quant.type == QuantizationType.FLOAT | ||
and input_quant.type == QuantizationType.FLOAT.value) | ||
is_4_bits = weight_quant.num_bits == 4 and input_quant.num_bits == 4 | ||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We should check for dynamic input as well
👀 |
Signed-off-by: Dipika Sikka <[email protected]>
Signed-off-by: Dipika <[email protected]>
Signed-off-by: Dipika <[email protected]>
Signed-off-by: Dipika Sikka <[email protected]>
I still want the default behavior to run w4a4 as w4a16 on hardware where we can support the Marlin kernel, so users have the same experience as FP8. If you want to keep the emulation pathway for evals, could we hide it behind an env var? |
Yeah I was thinking of combining the two schemes in a follow-up but I can do it in this PR. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Okay LGTM to land emulation and coalesce with marlin later
Please merge from main to fix the CI failures |
Progress has gone fast enough that we can wait for the next ct release next week. Converting to draft until then |
Summary
Note: Will need to update config validation + testing models after the next compressed-tensors release as ct 0.9.4 does not have the required compressor or new strategy that is required for nvfp4 activations
Generations:
Output:
Testing