Open
Description
We have been doing perf bench on MTBench so that e2e speedup and AL are comparable with other setups and academic papers. Thanks to @luyuzhe111 and others for the discussion and helping with measuring the gaps!
llama 3 8b
During model wt loading
- [V1][Spec Decode] Eagle Model loading #16035 (comment)
- [V1][Spec Decode] Eagle Model loading #16035 (comment)
During KV Cache slot
llama 3.1 8b
- [V1][Spec Decode] KV cache slots for eagle heads #16370 (comment)
- EAGLE - 1/3
- offline serving: [V1][Spec Decode] EAGLE-3 Support #16937 (comment)
- online serving: [Benchmark] Add single turn MTBench to Serving Bench #17202 (comment)
torch compile & CUDA graph: