Skip to content

[Performance]: Quantized Model Inference #17487

Open
@sneha5gsm

Description

@sneha5gsm

Report of performance regression

Hello
I recently quantized Llama 3 based 70B models using the llm-compressor quantization recipes, specifically:

  1. INT8 GPTQ: recipe = [
    SmoothQuantModifier(smoothing_strength=0.8),
    GPTQModifier(
    targets="Linear", scheme="W8A8", ignore=["lm_head"], dampening_frac=0.1
    ),
    ]
  2. INT8 PTQ: recipe = [
    SmoothQuantModifier(smoothing_strength=0.8),
    QuantizationModifier(targets="Linear", scheme="W8A8", ignore=["lm_head"]),
    ]
  3. FP8 GPTQ: recipe = [
    SmoothQuantModifier(smoothing_strength=0.8),
    GPTQModifier(
    targets="Linear", scheme="FP8", ignore=["lm_head"], dampening_frac=0.1
    ),
    ]
  4. FP8 PTQ: recipe = [
    SmoothQuantModifier(smoothing_strength=0.8),
    QuantizationModifier(targets="Linear", scheme="FP8", ignore=["lm_head"]),
    ]
  5. FP8 dynamic: recipe = QuantizationModifier(
    targets="Linear", scheme="FP8_DYNAMIC", ignore=["lm_head"]
    )
  6. INT4 PTQ: recipe = QuantizationModifier(
    targets="Linear", scheme="W4A16", ignore=["lm_head"]
    )
  7. INT4 GPTQ: recipe = GPTQModifier(targets="Linear", scheme="W4A16", ignore=["lm_head"])

I have tried to run some performance tests specifically the benchmark_throughput.py and benchmark_serving.py scripts and I see only a 5-22% (on a A100 GPU) throughput gain only for the INT8 PTQ quantized models. For all other models I see a degradation or same performance.

Questions:

  1. Why is the performance gain so less?
  2. Am I missing some configurations when serving the model? For example, reducing the max_model_len increases the throughput, why?
  3. Does the max_num_batched_tokens affect performance, if it does, how does it?
  4. Are certain performance optimized kernels only triggered for certain GPU architectures?
  5. I don't see any performance improvements for FP8 quantized models on a L40s GPU (this GPU supports FP8 computations), why?
  6. Decreasing the tensor_parallel_size from 8 to 4 or 2 reduces the throughput, why?
  7. As the precision halves, the number of computations that can be done by that precision's CUDA tensor core doubles, but we don't see these gains reflect for the quantized models, why?
  8. If there are no significant performance gains, why would someone choose to quantize models?

Note: vLLM version used - v0.7.3

Metadata

Metadata

Assignees

No one assigned

    Labels

    performancePerformance-related issues

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions