Open
Description
Report of performance regression
Hello
I recently quantized Llama 3 based 70B models using the llm-compressor quantization recipes, specifically:
- INT8 GPTQ: recipe = [
SmoothQuantModifier(smoothing_strength=0.8),
GPTQModifier(
targets="Linear", scheme="W8A8", ignore=["lm_head"], dampening_frac=0.1
),
] - INT8 PTQ: recipe = [
SmoothQuantModifier(smoothing_strength=0.8),
QuantizationModifier(targets="Linear", scheme="W8A8", ignore=["lm_head"]),
] - FP8 GPTQ: recipe = [
SmoothQuantModifier(smoothing_strength=0.8),
GPTQModifier(
targets="Linear", scheme="FP8", ignore=["lm_head"], dampening_frac=0.1
),
] - FP8 PTQ: recipe = [
SmoothQuantModifier(smoothing_strength=0.8),
QuantizationModifier(targets="Linear", scheme="FP8", ignore=["lm_head"]),
] - FP8 dynamic: recipe = QuantizationModifier(
targets="Linear", scheme="FP8_DYNAMIC", ignore=["lm_head"]
) - INT4 PTQ: recipe = QuantizationModifier(
targets="Linear", scheme="W4A16", ignore=["lm_head"]
) - INT4 GPTQ: recipe = GPTQModifier(targets="Linear", scheme="W4A16", ignore=["lm_head"])
I have tried to run some performance tests specifically the benchmark_throughput.py and benchmark_serving.py scripts and I see only a 5-22% (on a A100 GPU) throughput gain only for the INT8 PTQ quantized models. For all other models I see a degradation or same performance.
Questions:
- Why is the performance gain so less?
- Am I missing some configurations when serving the model? For example, reducing the max_model_len increases the throughput, why?
- Does the max_num_batched_tokens affect performance, if it does, how does it?
- Are certain performance optimized kernels only triggered for certain GPU architectures?
- I don't see any performance improvements for FP8 quantized models on a L40s GPU (this GPU supports FP8 computations), why?
- Decreasing the tensor_parallel_size from 8 to 4 or 2 reduces the throughput, why?
- As the precision halves, the number of computations that can be done by that precision's CUDA tensor core doubles, but we don't see these gains reflect for the quantized models, why?
- If there are no significant performance gains, why would someone choose to quantize models?
Note: vLLM version used - v0.7.3