Performance degradation with P40 on larger models

I have a machine with a lot of old parts in it, including 8 P40s and 2 Xeon E5-2667v2 CPUs.

I build llama.cpp using:
cmake -DLLAMA_AVX2=off -DLLAMA_F16C=off -DLLAMA_CUBLAS=on -DLLAMA_CUDA_FORCE_MMQ=on

Using a llama2-70b-Q8_0 model, I see good results with release b1842 and earlier.  With b1843 and newer, from January 12, with #4766, I see a ~62% drop:

bin/main -m ../text-generation-webui/models/Synthia-70b-v1.2.Q8_0.gguf -ngl 99 -p "Why is the sky blue?" -n 128

b1691: 10.76 t/s
b1767: 9.75 t/s
b1808: 9.76 t/s
b1832: 9.77 t/s
b1842: 9.76 t/s
b1843: 3.73 t/s
b2400: 3.83 t/s
b2709: 3.84 t/s

Trying the test with some other models, the discrepancy is much less in smaller models, to the point that the 8B model is considerably faster with the latest release:

| Model | b1842 | b1843 | b2709 |
| --- | --- | --- | --- |
| Synthia-70b-v1.2.Q8_0 | 9.76 t/s | 3.73 t/s | 3.84 t/s |
| phind-codellama-34b-v2.Q8_0 | 16.99 t/s | 7.54 t/s | 7.78 t/s |
| llama-2-13b-Q8_0 | 21.10 t/s | 17.67 t/s | 18.63 t/s |
| Meta-Llama-3-8B-Instruct.Q8_0 | 25.66 t/s | 33.27 t/s | 31.83 t/s |

Using fewer GPUs for this test (with the 70b model) makes b1842 a bit slower, but otherwise doesn't seem to change the result much:

| GPUs | b1842 | b1843 | b2709 |
| --- | --- | --- | --- |
| 8 | 9.76 t/s | 3.73 t/s | 3.84 t/s |
| 4 | 9.61 t/s | 3.77 t/s | 3.89 t/s |
| 3 | 8.32 t/s | 3.77 t/s | 3.91 t/s |

Changing the CPU thread count (with the 70b model) shows relative improvements for each build, but does not resolve the bigger discrepancies:

| Threads | b1842 | b2709 |
| --- | --- | --- |
| -t 1 | 10.05 t/s | 3.90 t/s |
| -t 4 | 10.06 t/s | 3.90 t/s |
| -t 8 | 10.09 t/s | 3.90 t/s |

The system is similar in topology to a Supermicro SYS-4028GR-TR2.  The GPUs are all PCIe 3.0x16 attached to PLX switches and have relatively good CPU and P2P bandwidth over PCIe -- 11-13GB/s between any pair.

Any ideas?

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Performance degradation with P40 on larger models #6814

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Model	b1842	b1843	b2709
Synthia-70b-v1.2.Q8_0	9.76 t/s	3.73 t/s	3.84 t/s
phind-codellama-34b-v2.Q8_0	16.99 t/s	7.54 t/s	7.78 t/s
llama-2-13b-Q8_0	21.10 t/s	17.67 t/s	18.63 t/s
Meta-Llama-3-8B-Instruct.Q8_0	25.66 t/s	33.27 t/s	31.83 t/s

Threads	b1842	b2709
-t 1	10.05 t/s	3.90 t/s
-t 4	10.06 t/s	3.90 t/s
-t 8	10.09 t/s	3.90 t/s

Performance degradation with P40 on larger models #6814

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions