[Performance]: Quantized Model Inference

### Report of performance regression

Hello
I recently quantized Llama 3 based 70B models using the [llm-compressor](https://github.com/vllm-project/llm-compressor) quantization recipes, specifically:

1. INT8 GPTQ: recipe = [
            SmoothQuantModifier(smoothing_strength=0.8),
            GPTQModifier(
                targets="Linear", scheme="W8A8", ignore=["lm_head"], dampening_frac=0.1
            ),
        ]
2. INT8 PTQ: recipe = [
            SmoothQuantModifier(smoothing_strength=0.8),
            QuantizationModifier(targets="Linear", scheme="W8A8", ignore=["lm_head"]),
        ]
3. FP8 GPTQ: recipe = [
            SmoothQuantModifier(smoothing_strength=0.8),
            GPTQModifier(
                targets="Linear", scheme="FP8", ignore=["lm_head"], dampening_frac=0.1
            ),
        ]
3. FP8 PTQ: recipe = [
            SmoothQuantModifier(smoothing_strength=0.8),
            QuantizationModifier(targets="Linear", scheme="FP8", ignore=["lm_head"]),
        ]
4. FP8 dynamic: recipe = QuantizationModifier(
        targets="Linear", scheme="FP8_DYNAMIC", ignore=["lm_head"]
    )
4. INT4 PTQ: recipe = QuantizationModifier(
            targets="Linear", scheme="W4A16", ignore=["lm_head"]
        )
5. INT4 GPTQ: recipe = GPTQModifier(targets="Linear", scheme="W4A16", ignore=["lm_head"])

I have tried to run some performance tests specifically the [benchmark_throughput.py](https://github.com/vllm-project/vllm/blob/main/benchmarks/benchmark_throughput.py) and [benchmark_serving.py](https://github.com/vllm-project/vllm/blob/main/benchmarks/benchmark_serving.py) scripts and I see only a 5-22% (on a A100 GPU) throughput gain only for the INT8 PTQ quantized models. For all other models I see a degradation or same performance.

Questions:
1. Why is the performance gain so less?
2. Am I missing some configurations when serving the model? For example, reducing the max_model_len increases the throughput, why?
3. Does the max_num_batched_tokens affect performance, if it does, how does it?
4. Are certain performance optimized kernels only triggered for certain GPU architectures?
5. I don't see any performance improvements for FP8 quantized models on a L40s GPU (this GPU supports FP8 computations), why?
6. Decreasing the tensor_parallel_size from 8 to 4 or 2 reduces the throughput, why?
7. As the precision halves, the number of computations that can be done by that precision's CUDA tensor core doubles, but we don't see these gains reflect for the quantized models, why?
8. If there are no significant performance gains, why would someone choose to quantize models?

Note: vLLM version used - v0.7.3


Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Uh oh!

[Performance]: Quantized Model Inference #17487

Report of performance regression

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Uh oh!

[Performance]: Quantized Model Inference #17487

Description

Report of performance regression

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions