Description
I have a machine with a lot of old parts in it, including 8 P40s and 2 Xeon E5-2667v2 CPUs.
I build llama.cpp using:
cmake -DLLAMA_AVX2=off -DLLAMA_F16C=off -DLLAMA_CUBLAS=on -DLLAMA_CUDA_FORCE_MMQ=on
Using a llama2-70b-Q8_0 model, I see good results with release b1842 and earlier. With b1843 and newer, from January 12, with #4766, I see a ~62% drop:
bin/main -m ../text-generation-webui/models/Synthia-70b-v1.2.Q8_0.gguf -ngl 99 -p "Why is the sky blue?" -n 128
b1691: 10.76 t/s
b1767: 9.75 t/s
b1808: 9.76 t/s
b1832: 9.77 t/s
b1842: 9.76 t/s
b1843: 3.73 t/s
b2400: 3.83 t/s
b2709: 3.84 t/s
Trying the test with some other models, the discrepancy is much less in smaller models, to the point that the 8B model is considerably faster with the latest release:
Model | b1842 | b1843 | b2709 |
---|---|---|---|
Synthia-70b-v1.2.Q8_0 | 9.76 t/s | 3.73 t/s | 3.84 t/s |
phind-codellama-34b-v2.Q8_0 | 16.99 t/s | 7.54 t/s | 7.78 t/s |
llama-2-13b-Q8_0 | 21.10 t/s | 17.67 t/s | 18.63 t/s |
Meta-Llama-3-8B-Instruct.Q8_0 | 25.66 t/s | 33.27 t/s | 31.83 t/s |
Using fewer GPUs for this test (with the 70b model) makes b1842 a bit slower, but otherwise doesn't seem to change the result much:
GPUs | b1842 | b1843 | b2709 |
---|---|---|---|
8 | 9.76 t/s | 3.73 t/s | 3.84 t/s |
4 | 9.61 t/s | 3.77 t/s | 3.89 t/s |
3 | 8.32 t/s | 3.77 t/s | 3.91 t/s |
Changing the CPU thread count (with the 70b model) shows relative improvements for each build, but does not resolve the bigger discrepancies:
Threads | b1842 | b2709 |
---|---|---|
-t 1 | 10.05 t/s | 3.90 t/s |
-t 4 | 10.06 t/s | 3.90 t/s |
-t 8 | 10.09 t/s | 3.90 t/s |
The system is similar in topology to a Supermicro SYS-4028GR-TR2. The GPUs are all PCIe 3.0x16 attached to PLX switches and have relatively good CPU and P2P bandwidth over PCIe -- 11-13GB/s between any pair.
Any ideas?