CPUSet support for Windows and Linux #6832

mann1x · 2024-04-22T18:37:48Z

This patch is a WIP and very likely bugged here and there,
Still it's already functional and seems to do what is supposed to do.

This patch only supports Windows and it's limited to processors from 4 to 64 logical cores.

Problems addressed:

Only uses physical cores
Filters out the E-Cores on Intel platforms
Sticks to the same Last Layer cache (eg. L3 for AMD Dual CCD processors)
Cores are selected based on their scheduler priority (default: worst to best cores)
Compute threads are only allocated on the selected cores
Disables Windows power management throttling (Power, Timer, Memory)
Always excludes Core 0

The main goal is to limit the unnecessary system load (eg. over 6 cores there's no scaling up on my 5950X, same performances as with 16 cores, 8 cores on the 2nd CCD are 10% faster with 5W less power consumption and half the system load).
At the same time excluding Core 0 means having always a reactive system and constant throughput from lllama.cpp.
Speed increase with GPU Offloading is minimal, about 1-2 t/s, but the system will be more reactive especially with partial offloading.

Two command line options and parameters for the context have been added:

-bco: Best Core Order, set to 1 will invert the default order and the cores will be selected from the best to the worst
-llct: Last Level Cache Traversal, set to 1 will allow the core selection to traverse the Last Level cache index

mann1x · 2024-04-24T19:55:11Z

Some fixes and added new options:

-acz: Allow Core Zero, set to 1 will allow selection of Core 0
-atc: Allow Threaded Cores, set to 1 will allow selection of threaded, non physical cores
-ccm: Custom Cpu Mask, allow setting a custom cpu affinity bitmask as integer

@ggerganov
Is the hack of the n_threads argument while parsing acceptable?
Do you have any comments?

cebtenzzre · 2024-04-25T15:10:15Z

common/common.cpp

+extern "C"
+NTSTATUS
+NTAPI
+NtQuerySystemInformationEx(
+	_In_ SYSTEM_INFORMATION_CLASS SystemInformationClass,
+	_In_reads_bytes_(InputBufferLength) PVOID InputBuffer,
+	_In_ ULONG InputBufferLength,
+	_Out_writes_bytes_opt_(SystemInformationLength) PVOID SystemInformation,
+	_In_ ULONG SystemInformationLength,
+	_Out_opt_ PULONG ReturnLength
+);
+
+
+extern "C"
+NTSTATUS
+NTAPI
+NtQueryInformationProcess(
+	_In_ HANDLE ProcessHandle,
+	_In_ PROCESSINFOCLASS ProcessInformationClass,
+	_Out_writes_bytes_opt_(ProcessInformationLength) PVOID ProcessInformation,
+	_In_ ULONG ProcessInformationLength,
+	_Out_opt_ PULONG ReturnLength
+);


These forward declarations don't seem to be used, so you should remove them.

These forward declarations don't seem to be used, so you should remove them.

Thanks for noticing it, I was using it for something else that I removed later.

Right now I'm adding support for Linux.
Found out the implementation was bugged; there is a typo in the sys path and the affinity for the process is never set.
I wouldn't be able to support the same as Windows but except the last level cache traversal everything else should be fine.
Got a nice 10% t/s speed up with a 5600G on Debian.

ggerganov · 2024-04-25T18:25:49Z

Do you have any comments?

This is quite a lot of code that I'm not familiar with - try to put it in separate common/cpuset.h+.cpp files with a very thin API in order to minimize the changes in common.cpp.

This seems targeted for Windows - would be interested in more feedback from Windows users

Do you expect any gains on Linux?

mann1x · 2024-04-25T19:04:27Z

This is quite a lot of code that I'm not familiar with - try to put it in separate common/cpuset.h+.cpp files with a very thin API in order to minimize the changes in common.cpp.

I will try but I'm not really sure if I can do a good job. My knowledge is limited :)

This seems targeted for Windows - would be interested in more feedback from Windows users

Yes, I started it on Windows because there was no automatic selection of the physical cores.
But the detection on Linux is bugged so together with the fix I'm also porting the same CPUSet implementation.

Different but similar, there are some limitations which I'm not yet sure I can overcome;

I don't know if I can get the same last level cache information as in Windows
Not sure yet if I can get the scheduler priority order like in Windows; I can get the CPPC tag for the AMD processors but on Intel it's often not used at all or not properly

Otherwise all the other features will be available; the custom core mask, skipping the core 0, including the threaded cores.

Feedback on Windows and on Linux too with the next commit would be really appreciated.

The patch also fixes an issue with the n_threads argument; it does specify the number of logical cores which are being used to spawn the threads.
This means that specifying more than the actual number of logical cores will make llama.cpp spawn threads on non existing cores and will hang in an endless loop.
With this patch the n_threads is trimmed to the actual number of available logical threads.

Do you expect any gains on Linux?

Same as for Windows, around 10%.

mann1x · 2024-04-25T20:31:44Z

@ggerganov
Added initial support for Linux, almost a 20% increase on my 5600G from 22 t/s to 26 t/s.
I will clean up the redundancies and will think on how to separate the changes.

github-actions · 2024-04-25T20:58:17Z

📈 llama.cpp server for bench-server-baseline on Standard_NC4as_T4_v3 for phi-2-q4_0: 437 iterations 🚀

Expand details for performance related PR only

Concurrent users: 8, duration: 10m
HTTP request : avg=10785.47ms p(95)=28179.74ms fails=, finish reason: stop=386 truncated=51
Prompt processing (pp): avg=110.51tk/s p(95)=483.24tk/s
Token generation (tg): avg=26.35tk/s p(95)=37.12tk/s
ggml-org/models/phi-2/ggml-model-q4_0.gguf parallel=8 ctx-size=16384 ngl=33 batch-size=2048 ubatch-size=256 pp=1024 pp+tg=2048 branch=mannix-win32-cpuset commit=063e201b020b8903f9467c00018b86e5a174b2cc

More

---
config:
    xyChart:
        titleFontSize: 12
        width: 900
        height: 600
    themeVariables:
        xyChart:
            titleColor: "#000000"
---
xychart-beta
    title "llama.cpp bench-server-baseline on Standard_NC4as_T4_v3
 duration=10m 437 iterations"
    y-axis "llamacpp:prompt_tokens_seconds"
    x-axis "llamacpp:prompt_tokens_seconds" 1714338238 --> 1714338870
    line [0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 445.89, 445.89, 445.89, 445.89, 445.89, 349.51, 349.51, 349.51, 349.51, 349.51, 371.18, 371.18, 371.18, 371.18, 371.18, 408.23, 408.23, 408.23, 408.23, 408.23, 443.02, 443.02, 443.02, 443.02, 443.02, 458.77, 458.77, 458.77, 458.77, 458.77, 466.54, 466.54, 466.54, 466.54, 466.54, 498.33, 498.33, 498.33, 498.33, 498.33, 501.59, 501.59, 501.59, 501.59, 501.59, 504.55, 504.55, 504.55, 504.55, 504.55, 517.27, 517.27, 517.27, 517.27, 517.27, 541.13, 541.13, 541.13, 541.13, 541.13, 535.4, 535.4, 535.4, 535.4, 535.4, 548.51, 548.51, 548.51, 548.51, 548.51, 565.27, 565.27, 565.27, 565.27, 565.27, 570.24, 570.24, 570.24, 570.24, 570.24, 585.73, 585.73, 585.73, 585.73, 585.73, 585.47, 585.47, 585.47, 585.47, 585.47, 586.46, 586.46, 586.46, 586.46, 586.46, 599.59, 599.59, 599.59, 599.59, 599.59, 602.91, 602.91, 602.91, 602.91, 602.91, 598.06, 598.06, 598.06, 598.06, 598.06, 598.4, 598.4, 598.4, 598.4, 598.4, 603.96, 603.96, 603.96, 603.96, 603.96, 608.06, 608.06, 608.06, 608.06, 608.06, 610.04, 610.04, 610.04, 610.04, 610.04, 584.6, 584.6, 584.6, 584.6, 584.6, 589.23, 589.23, 589.23, 589.23, 589.23, 592.62, 592.62, 592.62, 592.62, 592.62, 591.95, 591.95, 591.95, 591.95, 591.95, 590.25, 590.25, 590.25, 590.25, 590.25, 591.04, 591.04, 591.04, 591.04, 591.04, 592.06, 592.06, 592.06, 592.06, 592.06, 594.55, 594.55, 594.55, 594.55, 594.55, 599.67, 599.67, 599.67, 599.67, 599.67, 599.77, 599.77, 599.77, 599.77, 599.77, 602.6, 602.6, 602.6, 602.6, 602.6, 608.81, 608.81, 608.81, 608.81, 608.81, 615.23, 615.23, 615.23, 615.23, 615.23, 621.14, 621.14, 621.14, 621.14, 621.14, 628.36, 628.36, 628.36, 628.36, 628.36, 629.33, 629.33, 629.33, 629.33, 629.33, 628.27, 628.27, 628.27, 628.27, 628.27, 629.04, 629.04, 629.04, 629.04, 629.04, 631.4, 631.4, 631.4, 631.4, 631.4, 633.06, 633.06, 633.06, 633.06, 633.06, 628.93, 628.93, 628.93, 628.93, 628.93, 615.73, 615.73, 615.73, 615.73, 615.73, 617.6, 617.6, 617.6, 617.6, 617.6, 617.37, 617.37, 617.37, 617.37, 617.37, 613.96, 613.96, 613.96, 613.96, 613.96, 613.96, 613.96, 613.96, 613.96, 613.96, 613.96, 613.96, 613.96, 613.96, 613.96, 611.73, 611.73, 611.73, 611.73, 611.73, 609.83, 609.83, 609.83, 609.83, 609.83, 609.19, 609.19, 609.19, 609.19, 609.19, 610.93, 610.93, 610.93, 610.93, 610.93, 609.49, 609.49, 609.49, 609.49, 609.49, 609.8, 609.8, 609.8, 609.8, 609.8, 609.72, 609.72, 609.72, 609.72, 609.72, 609.84, 609.84]

More

---
config:
    xyChart:
        titleFontSize: 12
        width: 900
        height: 600
    themeVariables:
        xyChart:
            titleColor: "#000000"
---
xychart-beta
    title "llama.cpp bench-server-baseline on Standard_NC4as_T4_v3
 duration=10m 437 iterations"
    y-axis "llamacpp:predicted_tokens_seconds"
    x-axis "llamacpp:predicted_tokens_seconds" 1714338238 --> 1714338870
    line [0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 31.95, 31.95, 31.95, 31.95, 31.95, 32.47, 32.47, 32.47, 32.47, 32.47, 24.72, 24.72, 24.72, 24.72, 24.72, 26.88, 26.88, 26.88, 26.88, 26.88, 26.9, 26.9, 26.9, 26.9, 26.9, 25.82, 25.82, 25.82, 25.82, 25.82, 26.91, 26.91, 26.91, 26.91, 26.91, 27.77, 27.77, 27.77, 27.77, 27.77, 27.86, 27.86, 27.86, 27.86, 27.86, 27.81, 27.81, 27.81, 27.81, 27.81, 27.22, 27.22, 27.22, 27.22, 27.22, 26.65, 26.65, 26.65, 26.65, 26.65, 26.6, 26.6, 26.6, 26.6, 26.6, 26.43, 26.43, 26.43, 26.43, 26.43, 25.69, 25.69, 25.69, 25.69, 25.69, 25.33, 25.33, 25.33, 25.33, 25.33, 24.36, 24.36, 24.36, 24.36, 24.36, 24.11, 24.11, 24.11, 24.11, 24.11, 24.24, 24.24, 24.24, 24.24, 24.24, 24.36, 24.36, 24.36, 24.36, 24.36, 23.76, 23.76, 23.76, 23.76, 23.76, 23.75, 23.75, 23.75, 23.75, 23.75, 23.55, 23.55, 23.55, 23.55, 23.55, 23.16, 23.16, 23.16, 23.16, 23.16, 23.03, 23.03, 23.03, 23.03, 23.03, 23.11, 23.11, 23.11, 23.11, 23.11, 23.18, 23.18, 23.18, 23.18, 23.18, 23.34, 23.34, 23.34, 23.34, 23.34, 23.51, 23.51, 23.51, 23.51, 23.51, 23.57, 23.57, 23.57, 23.57, 23.57, 23.44, 23.44, 23.44, 23.44, 23.44, 23.35, 23.35, 23.35, 23.35, 23.35, 23.23, 23.23, 23.23, 23.23, 23.23, 23.37, 23.37, 23.37, 23.37, 23.37, 23.51, 23.51, 23.51, 23.51, 23.51, 23.64, 23.64, 23.64, 23.64, 23.64, 23.68, 23.68, 23.68, 23.68, 23.68, 23.77, 23.77, 23.77, 23.77, 23.77, 23.7, 23.7, 23.7, 23.7, 23.7, 23.64, 23.64, 23.64, 23.64, 23.64, 23.62, 23.62, 23.62, 23.62, 23.62, 23.39, 23.39, 23.39, 23.39, 23.39, 23.14, 23.14, 23.14, 23.14, 23.14, 23.13, 23.13, 23.13, 23.13, 23.13, 23.16, 23.16, 23.16, 23.16, 23.16, 23.21, 23.21, 23.21, 23.21, 23.21, 23.4, 23.4, 23.4, 23.4, 23.4, 23.39, 23.39, 23.39, 23.39, 23.39, 23.36, 23.36, 23.36, 23.36, 23.36, 23.3, 23.3, 23.3, 23.3, 23.3, 22.81, 22.81, 22.81, 22.81, 22.81, 22.81, 22.81, 22.81, 22.81, 22.81, 22.81, 22.81, 22.81, 22.81, 22.81, 22.67, 22.67, 22.67, 22.67, 22.67, 21.06, 21.06, 21.06, 21.06, 21.06, 20.65, 20.65, 20.65, 20.65, 20.65, 20.64, 20.64, 20.64, 20.64, 20.64, 20.59, 20.59, 20.59, 20.59, 20.59, 20.6, 20.6, 20.6, 20.6, 20.6, 20.65, 20.65, 20.65, 20.65, 20.65, 20.79, 20.79]

Details

More

---
config:
    xyChart:
        titleFontSize: 12
        width: 900
        height: 600
    themeVariables:
        xyChart:
            titleColor: "#000000"
---
xychart-beta
    title "llama.cpp bench-server-baseline on Standard_NC4as_T4_v3
 duration=10m 437 iterations"
    y-axis "llamacpp:kv_cache_usage_ratio"
    x-axis "llamacpp:kv_cache_usage_ratio" 1714338238 --> 1714338870
    line [0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.15, 0.15, 0.15, 0.15, 0.15, 0.34, 0.34, 0.34, 0.34, 0.34, 0.08, 0.08, 0.08, 0.08, 0.08, 0.18, 0.18, 0.18, 0.18, 0.18, 0.22, 0.22, 0.22, 0.22, 0.22, 0.08, 0.08, 0.08, 0.08, 0.08, 0.11, 0.11, 0.11, 0.11, 0.11, 0.11, 0.11, 0.11, 0.11, 0.11, 0.18, 0.18, 0.18, 0.18, 0.18, 0.17, 0.17, 0.17, 0.17, 0.17, 0.18, 0.18, 0.18, 0.18, 0.18, 0.19, 0.19, 0.19, 0.19, 0.19, 0.13, 0.13, 0.13, 0.13, 0.13, 0.32, 0.32, 0.32, 0.32, 0.32, 0.21, 0.21, 0.21, 0.21, 0.21, 0.29, 0.29, 0.29, 0.29, 0.29, 0.17, 0.17, 0.17, 0.17, 0.17, 0.11, 0.11, 0.11, 0.11, 0.11, 0.16, 0.16, 0.16, 0.16, 0.16, 0.28, 0.28, 0.28, 0.28, 0.28, 0.23, 0.23, 0.23, 0.23, 0.23, 0.28, 0.28, 0.28, 0.28, 0.28, 0.3, 0.3, 0.3, 0.3, 0.3, 0.16, 0.16, 0.16, 0.16, 0.16, 0.17, 0.17, 0.17, 0.17, 0.17, 0.16, 0.16, 0.16, 0.16, 0.16, 0.17, 0.17, 0.17, 0.17, 0.17, 0.12, 0.12, 0.12, 0.12, 0.12, 0.14, 0.14, 0.14, 0.14, 0.14, 0.29, 0.29, 0.29, 0.29, 0.29, 0.18, 0.18, 0.18, 0.18, 0.18, 0.22, 0.22, 0.22, 0.22, 0.22, 0.13, 0.13, 0.13, 0.13, 0.13, 0.1, 0.1, 0.1, 0.1, 0.1, 0.11, 0.11, 0.11, 0.11, 0.11, 0.11, 0.11, 0.11, 0.11, 0.11, 0.18, 0.18, 0.18, 0.18, 0.18, 0.16, 0.16, 0.16, 0.16, 0.16, 0.13, 0.13, 0.13, 0.13, 0.13, 0.15, 0.15, 0.15, 0.15, 0.15, 0.33, 0.33, 0.33, 0.33, 0.33, 0.2, 0.2, 0.2, 0.2, 0.2, 0.19, 0.19, 0.19, 0.19, 0.19, 0.14, 0.14, 0.14, 0.14, 0.14, 0.16, 0.16, 0.16, 0.16, 0.16, 0.11, 0.11, 0.11, 0.11, 0.11, 0.12, 0.12, 0.12, 0.12, 0.12, 0.36, 0.36, 0.36, 0.36, 0.36, 0.43, 0.43, 0.43, 0.43, 0.43, 0.51, 0.51, 0.51, 0.51, 0.51, 0.58, 0.58, 0.58, 0.58, 0.58, 0.63, 0.63, 0.63, 0.63, 0.63, 0.68, 0.68, 0.68, 0.68, 0.68, 0.64, 0.64, 0.64, 0.64, 0.64, 0.46, 0.46, 0.46, 0.46, 0.46, 0.12, 0.12, 0.12, 0.12, 0.12, 0.19, 0.19, 0.19, 0.19, 0.19, 0.17, 0.17, 0.17, 0.17, 0.17, 0.17, 0.17, 0.17, 0.17, 0.17, 0.12, 0.12, 0.12, 0.12, 0.12, 0.18, 0.18]

More

---
config:
    xyChart:
        titleFontSize: 12
        width: 900
        height: 600
    themeVariables:
        xyChart:
            titleColor: "#000000"
---
xychart-beta
    title "llama.cpp bench-server-baseline on Standard_NC4as_T4_v3
 duration=10m 437 iterations"
    y-axis "llamacpp:requests_processing"
    x-axis "llamacpp:requests_processing" 1714338238 --> 1714338870
    line [0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 2.0, 2.0, 2.0, 2.0, 2.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 4.0, 4.0, 4.0, 4.0, 4.0, 6.0, 6.0, 6.0, 6.0, 6.0, 7.0, 7.0, 7.0, 7.0, 7.0, 6.0, 6.0, 6.0, 6.0, 6.0, 5.0, 5.0, 5.0, 5.0, 5.0, 6.0, 6.0, 6.0, 6.0, 6.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 5.0, 5.0, 5.0, 5.0, 5.0, 8.0, 8.0, 8.0, 8.0, 8.0, 7.0, 7.0, 7.0, 7.0, 7.0, 6.0, 6.0, 6.0, 6.0, 6.0, 7.0, 7.0, 7.0, 7.0, 7.0, 7.0, 7.0, 7.0, 7.0, 7.0, 7.0, 7.0, 7.0, 7.0, 7.0, 6.0, 6.0, 6.0, 6.0, 6.0, 7.0, 7.0, 7.0, 7.0, 7.0, 5.0, 5.0, 5.0, 5.0, 5.0, 7.0, 7.0, 7.0, 7.0, 7.0, 7.0, 7.0, 7.0, 7.0, 7.0, 6.0, 6.0, 6.0, 6.0, 6.0, 4.0, 4.0, 4.0, 4.0, 4.0, 3.0, 3.0, 3.0, 3.0, 3.0, 7.0, 7.0, 7.0, 7.0, 7.0, 3.0, 3.0, 3.0, 3.0, 3.0, 5.0, 5.0, 5.0, 5.0, 5.0, 7.0, 7.0, 7.0, 7.0, 7.0, 8.0, 8.0, 8.0, 8.0, 8.0, 7.0, 7.0, 7.0, 7.0, 7.0, 6.0, 6.0, 6.0, 6.0, 6.0, 6.0, 6.0, 6.0, 6.0, 6.0, 4.0, 4.0, 4.0, 4.0, 4.0, 5.0, 5.0, 5.0, 5.0, 5.0, 5.0, 5.0, 5.0, 5.0, 5.0, 7.0, 7.0, 7.0, 7.0, 7.0, 7.0, 7.0, 7.0, 7.0, 7.0, 7.0, 7.0, 7.0, 7.0, 7.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 7.0, 7.0, 7.0, 7.0, 7.0, 5.0, 5.0, 5.0, 5.0, 5.0, 8.0, 8.0, 8.0, 8.0, 8.0, 2.0, 2.0, 2.0, 2.0, 2.0, 5.0, 5.0, 5.0, 5.0, 5.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 5.0, 5.0, 5.0, 5.0, 5.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 7.0, 7.0, 7.0, 7.0, 7.0, 5.0, 5.0, 5.0, 5.0, 5.0, 5.0, 5.0, 5.0, 5.0, 5.0, 1.0, 1.0]

mann1x · 2024-04-26T13:19:38Z

Looking for testers, I will open an issue to ask for help.

Found a limitation with Windows WSL; it reports almost perfectly the topology, except the last level cache, but it doesn't obey at all the affinity. Despite the process and the threads accepting the affinity and reporting it set alright, they are just run on random cores.
Anyone has an idea?

cpumaxx · 2024-04-26T23:20:00Z

I am testing this branch. What flags would you say provide the best speedup on Linux?

mann1x · 2024-04-27T11:14:10Z

I am testing this branch. What flags would you say provide the best speedup on Linux?

Ideally, no need to use any flag including -t.
In most of the cases, it will be the best, or almost the best, configuration.
There's no real all-around setting; it depends on your configuration and what you are doing (especially if the model is offloaded or not).

The default settings are skipping the first logical Core, ordering the cores from worst to best, skipping the non-physical cores, the E-Cores on Intel and the CCD jump on AMD processors.

The first thing you should notice if monitoring with htop is that the load will not be randomly put on all cores but it should be only on the first half of the cores (Linux puts the threaded cores on the second half, instead of 0/1, 2/3, like Windows), excluding the Core 0, and not on the E-Cores if they are present.

Skipping the Last Level cache doesn't work on Linux, so on AMD all the cores will be used if a 2nd CCD is present.

You can test if -t correctly uses only the requested number of cpus and the other options if they behave as expected by monitoring with htop

Compare with and without the patch and post the results if possible.

compilade · 2024-04-27T18:06:25Z

common/common.cpp


 /**
 * Returns number of CPUs on system that are useful for math.
 */
 int get_math_cpu_count() {
-#if defined(__x86_64__) && defined(__linux__)
+#if defined(__x86_164__) && defined(__linux__)


Is this a typo?

It's definitely a typo

Correcting it will skip the #elif below that #if for x86-64 Linux (may or may not be intended).

cpumaxx · 2024-04-28T02:12:12Z

For my specific case it appears that a very simple command line (model, seed, prompt and token count) resulted in a tiny reduction of performance of about 2%. Caches were dropped before each run.
I'm running dual Epyc Genoa with 64 cores/128 threads. It pinned all threads to the "first" 64 cores, but due to the NUMA layout it probably wasn't ideal.

node 0 cpus: 0 1 2 3 4 5 6 7 64 65 66 67 68 69 70 71
node 0 size: 96461 MB
node 0 free: 38947 MB
node 1 cpus: 8 9 10 11 12 13 14 15 72 73 74 75 76 77 78 79
node 1 size: 96729 MB
node 1 free: 82217 MB
node 2 cpus: 16 17 18 19 20 21 22 23 80 81 82 83 84 85 86 87
node 2 size: 96763 MB
node 2 free: 79540 MB
node 3 cpus: 24 25 26 27 28 29 30 31 88 89 90 91 92 93 94 95
node 3 size: 96763 MB
node 3 free: 81738 MB
node 4 cpus: 32 33 34 35 36 37 38 39 96 97 98 99 100 101 102 103
node 4 size: 96763 MB
node 4 free: 82506 MB
node 5 cpus: 40 41 42 43 44 45 46 47 104 105 106 107 108 109 110 111
node 5 size: 96763 MB
node 5 free: 82413 MB
node 6 cpus: 48 49 50 51 52 53 54 55 112 113 114 115 116 117 118 119
node 6 size: 96763 MB
node 6 free: 82277 MB
node 7 cpus: 56 57 58 59 60 61 62 63 120 121 122 123 124 125 126 127
node 7 size: 96717 MB
node 7 free: 82411 MB
node distances:
node   0   1   2   3   4   5   6   7 
  0:  10  12  12  12  32  32  32  32 
  1:  12  10  12  12  32  32  32  32 
  2:  12  12  10  12  32  32  32  32 
  3:  12  12  12  10  32  32  32  32 
  4:  32  32  32  32  10  12  12  12 
  5:  32  32  32  32  12  10  12  12 
  6:  32  32  32  32  12  12  10  12 
  7:  32  32  32  32  12  12  12  10

mann1x · 2024-04-28T04:44:27Z

For my specific case it appears that a very simple command line (model, seed, prompt and token count) resulted in a tiny reduction of performance of about 2%. Caches were dropped before each run.
I'm running dual Epyc Genoa with 64 cores/128 threads. It pinned all threads to the "first" 64 cores, but due to the NUMA layout it probably wasn't ideal.

That's really a lot of CPUs :)
Thanks for testing, does the numa switch actually works?

Adding support for more than 64 CPUs is doable but a bit more complex, maybe I can add the numa selection if it works.
Do you know if those allocated were all physical cores or also 2nd threads?

cpumaxx · 2024-04-28T15:42:44Z

That's really a lot of CPUs :) Thanks for testing, does the numa switch actually works?

Yes, the numa control flags work quite well, but they're mostly for isolating processes to subset of cores. I've documented a few use cases in https://rentry.org/miqumaxx

Adding support for more than 64 CPUs is doable but a bit more complex, maybe I can add the numa selection if it works. Do you know if those allocated were all physical cores or also 2nd threads?

According to the resource locality map in hwloc's lstopo util, I believe it was successfully targeting the first HT physical cores only:

Machine (756GB total)
  Package L#0
    L3 L#0 (32MB)
      NUMANode L#0 (P#0 94GB)
      L2 L#0 (1024KB) + L1d L#0 (32KB) + L1i L#0 (32KB) + Core L#0
        PU L#0 (P#0)
        PU L#1 (P#64)
      L2 L#1 (1024KB) + L1d L#1 (32KB) + L1i L#1 (32KB) + Core L#1
        PU L#2 (P#1)
        PU L#3 (P#65)
      L2 L#2 (1024KB) + L1d L#2 (32KB) + L1i L#2 (32KB) + Core L#2
        PU L#4 (P#2)
        PU L#5 (P#66)
      L2 L#3 (1024KB) + L1d L#3 (32KB) + L1i L#3 (32KB) + Core L#3
        PU L#6 (P#3)
        PU L#7 (P#67)
      L2 L#4 (1024KB) + L1d L#4 (32KB) + L1i L#4 (32KB) + Core L#4
        PU L#8 (P#4)
        PU L#9 (P#68)
      L2 L#5 (1024KB) + L1d L#5 (32KB) + L1i L#5 (32KB) + Core L#5
        PU L#10 (P#5)
        PU L#11 (P#69)
      L2 L#6 (1024KB) + L1d L#6 (32KB) + L1i L#6 (32KB) + Core L#6
        PU L#12 (P#6)
        PU L#13 (P#70)
      L2 L#7 (1024KB) + L1d L#7 (32KB) + L1i L#7 (32KB) + Core L#7
        PU L#14 (P#7)
        PU L#15 (P#71)
    L3 L#1 (32MB)
      NUMANode L#1 (P#1 94GB)
      L2 L#8 (1024KB) + L1d L#8 (32KB) + L1i L#8 (32KB) + Core L#8
        PU L#16 (P#8)
        PU L#17 (P#72)
      L2 L#9 (1024KB) + L1d L#9 (32KB) + L1i L#9 (32KB) + Core L#9
        PU L#18 (P#9)
        PU L#19 (P#73)
      L2 L#10 (1024KB) + L1d L#10 (32KB) + L1i L#10 (32KB) + Core L#10
        PU L#20 (P#10)
        PU L#21 (P#74)
      L2 L#11 (1024KB) + L1d L#11 (32KB) + L1i L#11 (32KB) + Core L#11
        PU L#22 (P#11)
        PU L#23 (P#75)
      L2 L#12 (1024KB) + L1d L#12 (32KB) + L1i L#12 (32KB) + Core L#12
        PU L#24 (P#12)
        PU L#25 (P#76)
      L2 L#13 (1024KB) + L1d L#13 (32KB) + L1i L#13 (32KB) + Core L#13
        PU L#26 (P#13)
        PU L#27 (P#77)
      L2 L#14 (1024KB) + L1d L#14 (32KB) + L1i L#14 (32KB) + Core L#14
        PU L#28 (P#14)
        PU L#29 (P#78)
      L2 L#15 (1024KB) + L1d L#15 (32KB) + L1i L#15 (32KB) + Core L#15
        PU L#30 (P#15)
        PU L#31 (P#79)
    L3 L#2 (32MB)
      NUMANode L#2 (P#2 94GB)
      L2 L#16 (1024KB) + L1d L#16 (32KB) + L1i L#16 (32KB) + Core L#16
        PU L#32 (P#16)
        PU L#33 (P#80)
      L2 L#17 (1024KB) + L1d L#17 (32KB) + L1i L#17 (32KB) + Core L#17
        PU L#34 (P#17)
        PU L#35 (P#81)
      L2 L#18 (1024KB) + L1d L#18 (32KB) + L1i L#18 (32KB) + Core L#18
        PU L#36 (P#18)
        PU L#37 (P#82)
      L2 L#19 (1024KB) + L1d L#19 (32KB) + L1i L#19 (32KB) + Core L#19
        PU L#38 (P#19)
        PU L#39 (P#83)
      L2 L#20 (1024KB) + L1d L#20 (32KB) + L1i L#20 (32KB) + Core L#20
        PU L#40 (P#20)
        PU L#41 (P#84)
      L2 L#21 (1024KB) + L1d L#21 (32KB) + L1i L#21 (32KB) + Core L#21
        PU L#42 (P#21)
        PU L#43 (P#85)
      L2 L#22 (1024KB) + L1d L#22 (32KB) + L1i L#22 (32KB) + Core L#22
        PU L#44 (P#22)
        PU L#45 (P#86)
      L2 L#23 (1024KB) + L1d L#23 (32KB) + L1i L#23 (32KB) + Core L#23
        PU L#46 (P#23)
        PU L#47 (P#87)
    L3 L#3 (32MB)
      NUMANode L#3 (P#3 94GB)
      L2 L#24 (1024KB) + L1d L#24 (32KB) + L1i L#24 (32KB) + Core L#24
        PU L#48 (P#24)
        PU L#49 (P#88)
      L2 L#25 (1024KB) + L1d L#25 (32KB) + L1i L#25 (32KB) + Core L#25
        PU L#50 (P#25)
        PU L#51 (P#89)
      L2 L#26 (1024KB) + L1d L#26 (32KB) + L1i L#26 (32KB) + Core L#26
        PU L#52 (P#26)
        PU L#53 (P#90)
      L2 L#27 (1024KB) + L1d L#27 (32KB) + L1i L#27 (32KB) + Core L#27
        PU L#54 (P#27)
        PU L#55 (P#91)
      L2 L#28 (1024KB) + L1d L#28 (32KB) + L1i L#28 (32KB) + Core L#28
        PU L#56 (P#28)
        PU L#57 (P#92)
      L2 L#29 (1024KB) + L1d L#29 (32KB) + L1i L#29 (32KB) + Core L#29
        PU L#58 (P#29)
        PU L#59 (P#93)
      L2 L#30 (1024KB) + L1d L#30 (32KB) + L1i L#30 (32KB) + Core L#30
        PU L#60 (P#30)
        PU L#61 (P#94)
      L2 L#31 (1024KB) + L1d L#31 (32KB) + L1i L#31 (32KB) + Core L#31
        PU L#62 (P#31)
        PU L#63 (P#95)
  Package L#1
    L3 L#4 (32MB)
      NUMANode L#4 (P#4 94GB)
      L2 L#32 (1024KB) + L1d L#32 (32KB) + L1i L#32 (32KB) + Core L#32
        PU L#64 (P#32)
        PU L#65 (P#96)
      L2 L#33 (1024KB) + L1d L#33 (32KB) + L1i L#33 (32KB) + Core L#33
        PU L#66 (P#33)
        PU L#67 (P#97)
      L2 L#34 (1024KB) + L1d L#34 (32KB) + L1i L#34 (32KB) + Core L#34
        PU L#68 (P#34)
        PU L#69 (P#98)
      L2 L#35 (1024KB) + L1d L#35 (32KB) + L1i L#35 (32KB) + Core L#35
        PU L#70 (P#35)
        PU L#71 (P#99)
      L2 L#36 (1024KB) + L1d L#36 (32KB) + L1i L#36 (32KB) + Core L#36
        PU L#72 (P#36)
        PU L#73 (P#100)
      L2 L#37 (1024KB) + L1d L#37 (32KB) + L1i L#37 (32KB) + Core L#37
        PU L#74 (P#37)
        PU L#75 (P#101)
      L2 L#38 (1024KB) + L1d L#38 (32KB) + L1i L#38 (32KB) + Core L#38
        PU L#76 (P#38)
        PU L#77 (P#102)
      L2 L#39 (1024KB) + L1d L#39 (32KB) + L1i L#39 (32KB) + Core L#39
        PU L#78 (P#39)
        PU L#79 (P#103)
    L3 L#5 (32MB)
      NUMANode L#5 (P#5 94GB)
      L2 L#40 (1024KB) + L1d L#40 (32KB) + L1i L#40 (32KB) + Core L#40
        PU L#80 (P#40)
        PU L#81 (P#104)
      L2 L#41 (1024KB) + L1d L#41 (32KB) + L1i L#41 (32KB) + Core L#41
        PU L#82 (P#41)
        PU L#83 (P#105)
      L2 L#42 (1024KB) + L1d L#42 (32KB) + L1i L#42 (32KB) + Core L#42
        PU L#84 (P#42)
        PU L#85 (P#106)
      L2 L#43 (1024KB) + L1d L#43 (32KB) + L1i L#43 (32KB) + Core L#43
        PU L#86 (P#43)
        PU L#87 (P#107)
      L2 L#44 (1024KB) + L1d L#44 (32KB) + L1i L#44 (32KB) + Core L#44
        PU L#88 (P#44)
        PU L#89 (P#108)
      L2 L#45 (1024KB) + L1d L#45 (32KB) + L1i L#45 (32KB) + Core L#45
        PU L#90 (P#45)
        PU L#91 (P#109)
      L2 L#46 (1024KB) + L1d L#46 (32KB) + L1i L#46 (32KB) + Core L#46
        PU L#92 (P#46)
        PU L#93 (P#110)
      L2 L#47 (1024KB) + L1d L#47 (32KB) + L1i L#47 (32KB) + Core L#47
        PU L#94 (P#47)
        PU L#95 (P#111)
    L3 L#6 (32MB)
      NUMANode L#6 (P#6 94GB)
      L2 L#48 (1024KB) + L1d L#48 (32KB) + L1i L#48 (32KB) + Core L#48
        PU L#96 (P#48)
        PU L#97 (P#112)
      L2 L#49 (1024KB) + L1d L#49 (32KB) + L1i L#49 (32KB) + Core L#49
        PU L#98 (P#49)
        PU L#99 (P#113)
      L2 L#50 (1024KB) + L1d L#50 (32KB) + L1i L#50 (32KB) + Core L#50
        PU L#100 (P#50)
        PU L#101 (P#114)
      L2 L#51 (1024KB) + L1d L#51 (32KB) + L1i L#51 (32KB) + Core L#51
        PU L#102 (P#51)
        PU L#103 (P#115)
      L2 L#52 (1024KB) + L1d L#52 (32KB) + L1i L#52 (32KB) + Core L#52
        PU L#104 (P#52)
        PU L#105 (P#116)
      L2 L#53 (1024KB) + L1d L#53 (32KB) + L1i L#53 (32KB) + Core L#53
        PU L#106 (P#53)
        PU L#107 (P#117)
      L2 L#54 (1024KB) + L1d L#54 (32KB) + L1i L#54 (32KB) + Core L#54
        PU L#108 (P#54)
        PU L#109 (P#118)
      L2 L#55 (1024KB) + L1d L#55 (32KB) + L1i L#55 (32KB) + Core L#55
        PU L#110 (P#55)
        PU L#111 (P#119)
    L3 L#7 (32MB)
      NUMANode L#7 (P#7 94GB)
      L2 L#56 (1024KB) + L1d L#56 (32KB) + L1i L#56 (32KB) + Core L#56
        PU L#112 (P#56)
        PU L#113 (P#120)
      L2 L#57 (1024KB) + L1d L#57 (32KB) + L1i L#57 (32KB) + Core L#57
        PU L#114 (P#57)
        PU L#115 (P#121)
      L2 L#58 (1024KB) + L1d L#58 (32KB) + L1i L#58 (32KB) + Core L#58
        PU L#116 (P#58)
        PU L#117 (P#122)
      L2 L#59 (1024KB) + L1d L#59 (32KB) + L1i L#59 (32KB) + Core L#59
        PU L#118 (P#59)
        PU L#119 (P#123)
      L2 L#60 (1024KB) + L1d L#60 (32KB) + L1i L#60 (32KB) + Core L#60
        PU L#120 (P#60)
        PU L#121 (P#124)
      L2 L#61 (1024KB) + L1d L#61 (32KB) + L1i L#61 (32KB) + Core L#61
        PU L#122 (P#61)
        PU L#123 (P#125)
      L2 L#62 (1024KB) + L1d L#62 (32KB) + L1i L#62 (32KB) + Core L#62
        PU L#124 (P#62)
        PU L#125 (P#126)
      L2 L#63 (1024KB) + L1d L#63 (32KB) + L1i L#63 (32KB) + Core L#63
        PU L#126 (P#63)
        PU L#127 (P#127)

I'm happy to keep testing this branch. Any speedup on numa systems is very interesting for me!

mann1x · 2024-04-28T19:53:15Z

@cpumaxx
Can you try the latest version?
It should be able to allocate all the threads now.
The custom core bitmask is still limited to the first 64 CPUs.

cpumaxx · 2024-04-28T20:27:15Z

I tried a git pull and found I already have the latest version of mannic-win32-cpuset (e5672d3), so my tests from yesterday are with what appears to be the latest code.
Or should I be on another branch or maybe there are uncommited changes?

mann1x · 2024-04-28T20:47:12Z

I tried a git pull and found I already have the latest version of mannic-win32-cpuset (e5672d3), so my tests from yesterday are with what appears to be the latest code. Or should I be on another branch or maybe there are uncommited changes?

No I just pushed the wrong branch from the wrong location...
Please have a look at it now.

cpumaxx · 2024-04-29T18:16:23Z

No I just pushed the wrong branch from the wrong location... Please have a look at it now.

I'm still seeing a 3% T/s slowdown vs. a fresh pull of the main branch with identical settings (forcing the same seed and automatically chosen number of cores). It may be that in my case using hyperthreaded cores is a net benefit
However, the larger problem is that it is only 25% of the speed vs. using a simple --numa distribute.
It makes sense that my setup is more sensitive to these kinds of tuning than most consumer ones.

mann1x · 2024-04-29T18:22:45Z

I'm still seeing a 3% T/s slowdown vs. a fresh pull of the main branch with identical settings (forcing the same seed and automatically chosen number of cores). It may be that in my case using hyperthreaded cores is a net benefit However, the larger problem is that it is only 25% of the speed vs. using a simple --numa distribute. It makes sense that my setup is more sensitive to these kinds of tuning than most consumer ones.

Thanks, so you can confirm now is using the whole 128 threads with -atc 1?
I will try to harmonize with the --numa setting and the patch @bmtwl is working on.
The performance difference could come also from the changes in master.
They have been a lot already and I still couldn't find how to merge.

cpumaxx · 2024-04-29T18:43:29Z

Thanks, so you can confirm now is using the whole 128 threads with -atc 1?

Yes, it does, but with the -atc 1 flag the performance absolutely collapses (only a third of the T/s compared to not using it) despite all 128 cores being pinned at 100% for the duration of inference.

They have been a lot already and I still couldn't find how to merge.

I just went through this with another PR. If you can't auto-merge in the github interface, you may have to peel out your diffs with "git diff" and re-apply them manually on a new branch

mann1x added 3 commits April 22, 2024 20:08

CpuSet support for Windows

b188c9c

Remove dubug flag

ca37f7d

Added new options and some fixes

f9b42b8

cebtenzzre reviewed Apr 25, 2024

View reviewed changes

Initial support for Linux

63cd3dc

mann1x changed the title ~~CPUSet support for Windows~~ CPUSet support for Windows and Linux Apr 26, 2024

mann1x added 2 commits April 26, 2024 08:56

Fixes

a3e75fe

Added set thread affinity for Linux

f7d2c0a

mann1x mentioned this pull request Apr 26, 2024

Help test CPUSet patch for Windows and Linux #6927

Closed

mann1x added 2 commits April 27, 2024 12:17

Added one worker thread per core on Windows

d55ae15

Added worker threads sticking to a single core for Linux

b01716a

Fixes

49c1657

compilade reviewed Apr 27, 2024

View reviewed changes

mann1x added 2 commits April 27, 2024 20:30

Fix typo

fa125a1

Fixes

e5672d3

Fixes, Linux support over 64 CPUs, Core 0 enabled at 6 cores and below

063e201

mofosyne added Review Complexity : Medium Generally require more time to grok but manageable by beginner to medium expertise level bugfix fixes an issue or bug labels May 9, 2024

CPUSet support for Windows and Linux #6832

Are you sure you want to change the base?

CPUSet support for Windows and Linux #6832

Uh oh!

Conversation

mann1x commented Apr 22, 2024

Uh oh!

mann1x commented Apr 24, 2024

Uh oh!

cebtenzzre Apr 25, 2024

Choose a reason for hiding this comment

Uh oh!

mann1x Apr 25, 2024

Choose a reason for hiding this comment

Uh oh!

ggerganov commented Apr 25, 2024

Uh oh!

mann1x commented Apr 25, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

mann1x commented Apr 25, 2024

Uh oh!

github-actions bot commented Apr 25, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

mann1x commented Apr 26, 2024

Uh oh!

cpumaxx commented Apr 26, 2024

Uh oh!

mann1x commented Apr 27, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

compilade Apr 27, 2024

Choose a reason for hiding this comment

Uh oh!

mann1x Apr 27, 2024

Choose a reason for hiding this comment

Uh oh!

compilade Apr 27, 2024

Choose a reason for hiding this comment

Uh oh!

cpumaxx commented Apr 28, 2024

Uh oh!

mann1x commented Apr 28, 2024

Uh oh!

cpumaxx commented Apr 28, 2024

Uh oh!

mann1x commented Apr 28, 2024

Uh oh!

cpumaxx commented Apr 28, 2024

Uh oh!

mann1x commented Apr 28, 2024

Uh oh!

cpumaxx commented Apr 29, 2024

Uh oh!

mann1x commented Apr 29, 2024

Uh oh!

cpumaxx commented Apr 29, 2024

Uh oh!

Uh oh!

mann1x commented Apr 25, 2024 •

edited

Loading

github-actions bot commented Apr 25, 2024 •

edited

Loading

mann1x commented Apr 27, 2024 •

edited

Loading