The inference performance of 8xH100+nvlink is worse than that of 4xA100 pcie

I tested llama.cpp on two systems, one with 4xA100 GPU and the other with 8xH100 GPU. The test results show that the inference performance of 8xH100+nvlink(21 tokens per socond) is worse than that of 4xA100 pcie(31 token per second), which is very strange! Can anyone help explain this behavior? How can I improve H100? Thanks

<img width="955" alt="image" src="https://github.com/ggerganov/llama.cpp/assets/3538547/22727444-3cc7-442d-905b-dc007cefd265">
<img width="950" alt="image" src="https://github.com/ggerganov/llama.cpp/assets/3538547/1a5fb54e-e0eb-4c91-adb7-97fd2b21abe8">


Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

The inference performance of 8xH100+nvlink is worse than that of 4xA100 pcie #4747

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

The inference performance of 8xH100+nvlink is worse than that of 4xA100 pcie #4747

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions