[Performance]: SGLANG is 4 times faster than vLLM for Qwen/Qwen3-32B-AWQ

### Proposal to improve performance

_No response_

### Report of performance regression

_No response_

### Misc discussion on performance

vLLM command 
```
VLLM_ATTENTION_BACKEND=FLASHINFER python3 -m vllm.entrypoints.openai.api_server --model Qwen/Qwen3-32B-AWQ --port 8000 --gpu-memory-utilization 0.90 --tensor-parallel-size 4  --disable-log-requests --quantization awq_marlin -O3
```

SGLang command
```
python -m sglang.launch_server --model-path Qwen/Qwen3-32B-AWQ --port 8000 --tensor-parallel-size 4 --quantization awq_marlin --dtype auto --enable-torch-compile --attention-backend flashinfer --show-time-cost --enable-metrics
```

benchmarking command 

```
vllm bench serve \
--model Qwen/Qwen3-32B-AWQ \
--num-prompts 50 \
--random-input-len 25000 \
--random-output-len 1024 \
--ignore-eos \
--request-rate inf \
```

vLLM results

```
============ Serving Benchmark Result ============
Successful requests:                     50        
Benchmark duration (s):                  7303.93   
Total input tokens:                      1250000   
Total generated tokens:                  51200     
Request throughput (req/s):              0.01      
Output token throughput (tok/s):         7.01      
Total Token throughput (tok/s):          178.15    
---------------Time to First Token----------------
Mean TTFT (ms):                          2945424.93
Median TTFT (ms):                        3210790.97
P99 TTFT (ms):                           6428720.18
-----Time per Output Token (excl. 1st token)------
Mean TPOT (ms):                          1013.04   
Median TPOT (ms):                        1013.40   
P99 TPOT (ms):                           1300.17   
---------------Inter-token Latency----------------
Mean ITL (ms):                           1013.04   
Median ITL (ms):                         901.30    
P99 ITL (ms):                            2331.21   
==================================================
```

SGLang results

```
============ Serving Benchmark Result ============
Successful requests:                     50        
Benchmark duration (s):                  1737.44   
Total input tokens:                      1250000   
Total generated tokens:                  51200     
Request throughput (req/s):              0.03      
Output token throughput (tok/s):         29.47     
Total Token throughput (tok/s):          748.92    
---------------Time to First Token----------------
Mean TTFT (ms):                          815107.71 
Median TTFT (ms):                        827756.68 
P99 TTFT (ms):                           1662652.03
-----Time per Output Token (excl. 1st token)------
Mean TPOT (ms):                          153.73    
Median TPOT (ms):                        154.43    
P99 TPOT (ms):                           374.27    
---------------Inter-token Latency----------------
Mean ITL (ms):                           153.75    
Median ITL (ms):                         46.95     
P99 ITL (ms):                            62.72     
==================================================
```


Am i missing an important argument to include in vLLM?


### Your current environment (if you think it is necessary)

```text
Collecting environment information...
PyTorch version: 2.7.0+cu126
Is debug build: False
CUDA used to build PyTorch: 12.6
ROCM used to build PyTorch: N/A

OS: Amazon Linux 2023.6.20250303 (x86_64)
GCC version: (GCC) 11.4.1 20230605 (Red Hat 11.4.1-2)
Clang version: Could not collect
CMake version: version 3.31.6
Libc version: glibc-2.34

Python version: 3.11.11 (main, Mar 17 2025, 21:02:09) [Clang 20.1.0 ] (64-bit runtime)
Python platform: Linux-6.1.129-138.220.amzn2023.x86_64-x86_64-with-glibc2.34
Is CUDA available: True
CUDA runtime version: 12.4.131
CUDA_MODULE_LOADING set to: LAZY
GPU models and configuration: 
GPU 0: NVIDIA A10G
GPU 1: NVIDIA A10G
GPU 2: NVIDIA A10G
GPU 3: NVIDIA A10G

Nvidia driver version: 560.35.03
cuDNN version: Could not collect
HIP runtime version: N/A
MIOpen runtime version: N/A
Is XNNPACK available: True

CPU:
Architecture:                         x86_64
CPU op-mode(s):                       32-bit, 64-bit
Address sizes:                        48 bits physical, 48 bits virtual
Byte Order:                           Little Endian
CPU(s):                               48
On-line CPU(s) list:                  0-47
Vendor ID:                            AuthenticAMD
Model name:                           AMD EPYC 7R32
CPU family:                           23
Model:                                49
Thread(s) per core:                   2
Core(s) per socket:                   24
Socket(s):                            1
Stepping:                             0
BogoMIPS:                             5600.00
Flags:                                fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush mmx fxsr sse sse2 ht syscall nx mmxext fxsr_opt pdpe1gb rdtscp lm constant_tsc rep_good nopl nonstop_tsc cpuid extd_apicid aperfmperf tsc_known_freq pni pclmulqdq ssse3 fma cx16 sse4_1 sse4_2 movbe popcnt aes xsave avx f16c rdrand hypervisor lahf_lm cmp_legacy cr8_legacy abm sse4a misalignsse 3dnowprefetch topoext ssbd ibrs ibpb stibp vmmcall fsgsbase bmi1 avx2 smep bmi2 rdseed adx smap clflushopt clwb sha_ni xsaveopt xsavec xgetbv1 clzero xsaveerptr rdpru wbnoinvd arat npt nrip_save rdpid
Hypervisor vendor:                    KVM
Virtualization type:                  full
L1d cache:                            768 KiB (24 instances)
L1i cache:                            768 KiB (24 instances)
L2 cache:                             12 MiB (24 instances)
L3 cache:                             96 MiB (6 instances)
NUMA node(s):                         1
NUMA node0 CPU(s):                    0-47
Vulnerability Gather data sampling:   Not affected
Vulnerability Itlb multihit:          Not affected
Vulnerability L1tf:                   Not affected
Vulnerability Mds:                    Not affected
Vulnerability Meltdown:               Not affected
Vulnerability Mmio stale data:        Not affected
Vulnerability Reg file data sampling: Not affected
Vulnerability Retbleed:               Mitigation; untrained return thunk; SMT enabled with STIBP protection
Vulnerability Spec rstack overflow:   Mitigation; safe RET
Vulnerability Spec store bypass:      Mitigation; Speculative Store Bypass disabled via prctl
Vulnerability Spectre v1:             Mitigation; usercopy/swapgs barriers and __user pointer sanitization
Vulnerability Spectre v2:             Mitigation; Retpolines; IBPB conditional; STIBP always-on; RSB filling; PBRSB-eIBRS Not affected; BHI Not affected
Vulnerability Srbds:                  Not affected
Vulnerability Tsx async abort:        Not affected

Versions of relevant libraries:
[pip3] flake8==7.2.0
[pip3] numpy==2.2.5
[pip3] nvidia-cublas-cu12==12.6.4.1
[pip3] nvidia-cuda-cupti-cu12==12.6.80
[pip3] nvidia-cuda-nvrtc-cu12==12.6.77
[pip3] nvidia-cuda-runtime-cu12==12.6.77
[pip3] nvidia-cudnn-cu12==9.5.1.17
[pip3] nvidia-cufft-cu12==11.3.0.4
[pip3] nvidia-cufile-cu12==1.11.1.6
[pip3] nvidia-curand-cu12==10.3.7.77
[pip3] nvidia-cusolver-cu12==11.7.1.2
[pip3] nvidia-cusparse-cu12==12.5.4.2
[pip3] nvidia-cusparselt-cu12==0.6.3
[pip3] nvidia-ml-py==12.570.86
[pip3] nvidia-nccl-cu12==2.26.2
[pip3] nvidia-nvjitlink-cu12==12.6.85
[pip3] nvidia-nvtx-cu12==12.6.77
[pip3] pynvml==12.0.0
[pip3] pyzmq==26.4.0
[pip3] torch==2.7.0
[pip3] torchao==0.10.0
[pip3] torchaudio==2.7.0
[pip3] torchvision==0.22.0
[pip3] transformers==4.51.1
[pip3] triton==3.3.0
[conda] Could not collect
ROCM Version: Could not collect
Neuron SDK Version: N/A
vLLM Version: 0.8.5.dev659+g12e6c0b41 (git sha: 12e6c0b41)
vLLM Build Flags:
CUDA Archs: Not Set; ROCm: Disabled; Neuron: Disabled
GPU Topology:
        GPU0    GPU1    GPU2    GPU3    CPU Affinity    NUMA Affinity   GPU NUMA ID
GPU0     X      PHB     PHB     PHB     0-47    0               N/A
GPU1    PHB      X      PHB     PHB     0-47    0               N/A
GPU2    PHB     PHB      X      PHB     0-47    0               N/A
GPU3    PHB     PHB     PHB      X      0-47    0               N/A

Legend:

  X    = Self
  SYS  = Connection traversing PCIe as well as the SMP interconnect between NUMA nodes (e.g., QPI/UPI)
  NODE = Connection traversing PCIe as well as the interconnect between PCIe Host Bridges within a NUMA node
  PHB  = Connection traversing PCIe as well as a PCIe Host Bridge (typically the CPU)
  PXB  = Connection traversing multiple PCIe bridges (without traversing the PCIe Host Bridge)
  PIX  = Connection traversing at most a single PCIe bridge
  NV#  = Connection traversing a bonded set of # NVLinks

LD_LIBRARY_PATH=/opt/amazon/efa/lib64:/opt/amazon/openmpi/lib64:/opt/amazon/ofi-nccl/lib64:/usr/local/cuda-12.4/lib:/usr/local/cuda-12.4/lib64:/usr/local/cuda-12.4:/usr/local/cuda-12.4/targets/x86_64-linux/lib/:/usr/local/lib:/usr/lib:/lib:/opt/amazon/efa/lib64:/opt/amazon/openmpi/lib64:/opt/amazon/ofi-nccl/lib64:/usr/local/cuda-12.4/lib:/usr/local/cuda-12.4/lib64:/usr/local/cuda-12.4:/usr/local/cuda-12.4/targets/x86_64-linux/lib/:/usr/local/lib:/usr/lib:/lib:/opt/amazon/efa/lib64:/opt/amazon/openmpi/lib64:/opt/amazon/ofi-nccl/lib64:/usr/local/cuda-12.4/lib:/usr/local/cuda-12.4/lib64:/usr/local/cuda-12.4:/usr/local/cuda-12.4/targets/x86_64-linux/lib/:/usr/local/lib:/usr/lib:/lib
NCCL_CUMEM_ENABLE=0
PYTORCH_NVML_BASED_CUDA_CHECK=1
TORCHINDUCTOR_COMPILE_THREADS=1
CUDA_MODULE_LOADING=LAZY
```


```
sglang version == Version: 0.4.6.post2
```


### Before submitting a new issue...

- [x] Make sure you already searched for relevant issues, and asked the chatbot living at the bottom right corner of the [documentation page](https://docs.vllm.ai/en/latest/), which can answer lots of frequently asked questions.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Uh oh!

[Performance]: SGLANG is 4 times faster than vLLM for Qwen/Qwen3-32B-AWQ #18136

Proposal to improve performance

Report of performance regression

Misc discussion on performance

Your current environment (if you think it is necessary)

Before submitting a new issue...

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Uh oh!

[Performance]: SGLANG is 4 times faster than vLLM for Qwen/Qwen3-32B-AWQ #18136

Description

Proposal to improve performance

Report of performance regression

Misc discussion on performance

Your current environment (if you think it is necessary)

Before submitting a new issue...

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions