Skip to content

CPUSet support for Windows and Linux #6832

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
wants to merge 12 commits into
base: master
Choose a base branch
from

Conversation

mann1x
Copy link

@mann1x mann1x commented Apr 22, 2024

This patch is a WIP and very likely bugged here and there,
Still it's already functional and seems to do what is supposed to do.

This patch only supports Windows and it's limited to processors from 4 to 64 logical cores.

Problems addressed:

  • Only uses physical cores
  • Filters out the E-Cores on Intel platforms
  • Sticks to the same Last Layer cache (eg. L3 for AMD Dual CCD processors)
  • Cores are selected based on their scheduler priority (default: worst to best cores)
  • Compute threads are only allocated on the selected cores
  • Disables Windows power management throttling (Power, Timer, Memory)
  • Always excludes Core 0

The main goal is to limit the unnecessary system load (eg. over 6 cores there's no scaling up on my 5950X, same performances as with 16 cores, 8 cores on the 2nd CCD are 10% faster with 5W less power consumption and half the system load).
At the same time excluding Core 0 means having always a reactive system and constant throughput from lllama.cpp.
Speed increase with GPU Offloading is minimal, about 1-2 t/s, but the system will be more reactive especially with partial offloading.

Two command line options and parameters for the context have been added:

  • -bco: Best Core Order, set to 1 will invert the default order and the cores will be selected from the best to the worst
  • -llct: Last Level Cache Traversal, set to 1 will allow the core selection to traverse the Last Level cache index

@mann1x
Copy link
Author

mann1x commented Apr 24, 2024

Some fixes and added new options:

  • -acz: Allow Core Zero, set to 1 will allow selection of Core 0
  • -atc: Allow Threaded Cores, set to 1 will allow selection of threaded, non physical cores
  • -ccm: Custom Cpu Mask, allow setting a custom cpu affinity bitmask as integer

@ggerganov
Is the hack of the n_threads argument while parsing acceptable?
Do you have any comments?

Comment on lines 440 to 462
extern "C"
NTSTATUS
NTAPI
NtQuerySystemInformationEx(
_In_ SYSTEM_INFORMATION_CLASS SystemInformationClass,
_In_reads_bytes_(InputBufferLength) PVOID InputBuffer,
_In_ ULONG InputBufferLength,
_Out_writes_bytes_opt_(SystemInformationLength) PVOID SystemInformation,
_In_ ULONG SystemInformationLength,
_Out_opt_ PULONG ReturnLength
);


extern "C"
NTSTATUS
NTAPI
NtQueryInformationProcess(
_In_ HANDLE ProcessHandle,
_In_ PROCESSINFOCLASS ProcessInformationClass,
_Out_writes_bytes_opt_(ProcessInformationLength) PVOID ProcessInformation,
_In_ ULONG ProcessInformationLength,
_Out_opt_ PULONG ReturnLength
);
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

These forward declarations don't seem to be used, so you should remove them.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

These forward declarations don't seem to be used, so you should remove them.

Thanks for noticing it, I was using it for something else that I removed later.

Right now I'm adding support for Linux.
Found out the implementation was bugged; there is a typo in the sys path and the affinity for the process is never set.
I wouldn't be able to support the same as Windows but except the last level cache traversal everything else should be fine.
Got a nice 10% t/s speed up with a 5600G on Debian.

@ggerganov
Copy link
Member

Do you have any comments?

This is quite a lot of code that I'm not familiar with - try to put it in separate common/cpuset.h+.cpp files with a very thin API in order to minimize the changes in common.cpp.

This seems targeted for Windows - would be interested in more feedback from Windows users

Do you expect any gains on Linux?

@mann1x
Copy link
Author

mann1x commented Apr 25, 2024

This is quite a lot of code that I'm not familiar with - try to put it in separate common/cpuset.h+.cpp files with a very thin API in order to minimize the changes in common.cpp.

I will try but I'm not really sure if I can do a good job. My knowledge is limited :)

This seems targeted for Windows - would be interested in more feedback from Windows users

Yes, I started it on Windows because there was no automatic selection of the physical cores.
But the detection on Linux is bugged so together with the fix I'm also porting the same CPUSet implementation.

Different but similar, there are some limitations which I'm not yet sure I can overcome;

  • I don't know if I can get the same last level cache information as in Windows
  • Not sure yet if I can get the scheduler priority order like in Windows; I can get the CPPC tag for the AMD processors but on Intel it's often not used at all or not properly

Otherwise all the other features will be available; the custom core mask, skipping the core 0, including the threaded cores.

Feedback on Windows and on Linux too with the next commit would be really appreciated.

The patch also fixes an issue with the n_threads argument; it does specify the number of logical cores which are being used to spawn the threads.
This means that specifying more than the actual number of logical cores will make llama.cpp spawn threads on non existing cores and will hang in an endless loop.
With this patch the n_threads is trimmed to the actual number of available logical threads.

Do you expect any gains on Linux?

Same as for Windows, around 10%.

@mann1x
Copy link
Author

mann1x commented Apr 25, 2024

@ggerganov
Added initial support for Linux, almost a 20% increase on my 5600G from 22 t/s to 26 t/s.
I will clean up the redundancies and will think on how to separate the changes.

Copy link
Contributor

github-actions bot commented Apr 25, 2024

📈 llama.cpp server for bench-server-baseline on Standard_NC4as_T4_v3 for phi-2-q4_0: 437 iterations 🚀

Expand details for performance related PR only
  • Concurrent users: 8, duration: 10m
  • HTTP request : avg=10785.47ms p(95)=28179.74ms fails=, finish reason: stop=386 truncated=51
  • Prompt processing (pp): avg=110.51tk/s p(95)=483.24tk/s
  • Token generation (tg): avg=26.35tk/s p(95)=37.12tk/s
  • ggml-org/models/phi-2/ggml-model-q4_0.gguf parallel=8 ctx-size=16384 ngl=33 batch-size=2048 ubatch-size=256 pp=1024 pp+tg=2048 branch=mannix-win32-cpuset commit=063e201b020b8903f9467c00018b86e5a174b2cc

prompt_tokens_seconds

More
---
config:
    xyChart:
        titleFontSize: 12
        width: 900
        height: 600
    themeVariables:
        xyChart:
            titleColor: "#000000"
---
xychart-beta
    title "llama.cpp bench-server-baseline on Standard_NC4as_T4_v3
 duration=10m 437 iterations"
    y-axis "llamacpp:prompt_tokens_seconds"
    x-axis "llamacpp:prompt_tokens_seconds" 1714338238 --> 1714338870
    line [0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 445.89, 445.89, 445.89, 445.89, 445.89, 349.51, 349.51, 349.51, 349.51, 349.51, 371.18, 371.18, 371.18, 371.18, 371.18, 408.23, 408.23, 408.23, 408.23, 408.23, 443.02, 443.02, 443.02, 443.02, 443.02, 458.77, 458.77, 458.77, 458.77, 458.77, 466.54, 466.54, 466.54, 466.54, 466.54, 498.33, 498.33, 498.33, 498.33, 498.33, 501.59, 501.59, 501.59, 501.59, 501.59, 504.55, 504.55, 504.55, 504.55, 504.55, 517.27, 517.27, 517.27, 517.27, 517.27, 541.13, 541.13, 541.13, 541.13, 541.13, 535.4, 535.4, 535.4, 535.4, 535.4, 548.51, 548.51, 548.51, 548.51, 548.51, 565.27, 565.27, 565.27, 565.27, 565.27, 570.24, 570.24, 570.24, 570.24, 570.24, 585.73, 585.73, 585.73, 585.73, 585.73, 585.47, 585.47, 585.47, 585.47, 585.47, 586.46, 586.46, 586.46, 586.46, 586.46, 599.59, 599.59, 599.59, 599.59, 599.59, 602.91, 602.91, 602.91, 602.91, 602.91, 598.06, 598.06, 598.06, 598.06, 598.06, 598.4, 598.4, 598.4, 598.4, 598.4, 603.96, 603.96, 603.96, 603.96, 603.96, 608.06, 608.06, 608.06, 608.06, 608.06, 610.04, 610.04, 610.04, 610.04, 610.04, 584.6, 584.6, 584.6, 584.6, 584.6, 589.23, 589.23, 589.23, 589.23, 589.23, 592.62, 592.62, 592.62, 592.62, 592.62, 591.95, 591.95, 591.95, 591.95, 591.95, 590.25, 590.25, 590.25, 590.25, 590.25, 591.04, 591.04, 591.04, 591.04, 591.04, 592.06, 592.06, 592.06, 592.06, 592.06, 594.55, 594.55, 594.55, 594.55, 594.55, 599.67, 599.67, 599.67, 599.67, 599.67, 599.77, 599.77, 599.77, 599.77, 599.77, 602.6, 602.6, 602.6, 602.6, 602.6, 608.81, 608.81, 608.81, 608.81, 608.81, 615.23, 615.23, 615.23, 615.23, 615.23, 621.14, 621.14, 621.14, 621.14, 621.14, 628.36, 628.36, 628.36, 628.36, 628.36, 629.33, 629.33, 629.33, 629.33, 629.33, 628.27, 628.27, 628.27, 628.27, 628.27, 629.04, 629.04, 629.04, 629.04, 629.04, 631.4, 631.4, 631.4, 631.4, 631.4, 633.06, 633.06, 633.06, 633.06, 633.06, 628.93, 628.93, 628.93, 628.93, 628.93, 615.73, 615.73, 615.73, 615.73, 615.73, 617.6, 617.6, 617.6, 617.6, 617.6, 617.37, 617.37, 617.37, 617.37, 617.37, 613.96, 613.96, 613.96, 613.96, 613.96, 613.96, 613.96, 613.96, 613.96, 613.96, 613.96, 613.96, 613.96, 613.96, 613.96, 611.73, 611.73, 611.73, 611.73, 611.73, 609.83, 609.83, 609.83, 609.83, 609.83, 609.19, 609.19, 609.19, 609.19, 609.19, 610.93, 610.93, 610.93, 610.93, 610.93, 609.49, 609.49, 609.49, 609.49, 609.49, 609.8, 609.8, 609.8, 609.8, 609.8, 609.72, 609.72, 609.72, 609.72, 609.72, 609.84, 609.84]
                    
Loading
predicted_tokens_seconds
More
---
config:
    xyChart:
        titleFontSize: 12
        width: 900
        height: 600
    themeVariables:
        xyChart:
            titleColor: "#000000"
---
xychart-beta
    title "llama.cpp bench-server-baseline on Standard_NC4as_T4_v3
 duration=10m 437 iterations"
    y-axis "llamacpp:predicted_tokens_seconds"
    x-axis "llamacpp:predicted_tokens_seconds" 1714338238 --> 1714338870
    line [0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 31.95, 31.95, 31.95, 31.95, 31.95, 32.47, 32.47, 32.47, 32.47, 32.47, 24.72, 24.72, 24.72, 24.72, 24.72, 26.88, 26.88, 26.88, 26.88, 26.88, 26.9, 26.9, 26.9, 26.9, 26.9, 25.82, 25.82, 25.82, 25.82, 25.82, 26.91, 26.91, 26.91, 26.91, 26.91, 27.77, 27.77, 27.77, 27.77, 27.77, 27.86, 27.86, 27.86, 27.86, 27.86, 27.81, 27.81, 27.81, 27.81, 27.81, 27.22, 27.22, 27.22, 27.22, 27.22, 26.65, 26.65, 26.65, 26.65, 26.65, 26.6, 26.6, 26.6, 26.6, 26.6, 26.43, 26.43, 26.43, 26.43, 26.43, 25.69, 25.69, 25.69, 25.69, 25.69, 25.33, 25.33, 25.33, 25.33, 25.33, 24.36, 24.36, 24.36, 24.36, 24.36, 24.11, 24.11, 24.11, 24.11, 24.11, 24.24, 24.24, 24.24, 24.24, 24.24, 24.36, 24.36, 24.36, 24.36, 24.36, 23.76, 23.76, 23.76, 23.76, 23.76, 23.75, 23.75, 23.75, 23.75, 23.75, 23.55, 23.55, 23.55, 23.55, 23.55, 23.16, 23.16, 23.16, 23.16, 23.16, 23.03, 23.03, 23.03, 23.03, 23.03, 23.11, 23.11, 23.11, 23.11, 23.11, 23.18, 23.18, 23.18, 23.18, 23.18, 23.34, 23.34, 23.34, 23.34, 23.34, 23.51, 23.51, 23.51, 23.51, 23.51, 23.57, 23.57, 23.57, 23.57, 23.57, 23.44, 23.44, 23.44, 23.44, 23.44, 23.35, 23.35, 23.35, 23.35, 23.35, 23.23, 23.23, 23.23, 23.23, 23.23, 23.37, 23.37, 23.37, 23.37, 23.37, 23.51, 23.51, 23.51, 23.51, 23.51, 23.64, 23.64, 23.64, 23.64, 23.64, 23.68, 23.68, 23.68, 23.68, 23.68, 23.77, 23.77, 23.77, 23.77, 23.77, 23.7, 23.7, 23.7, 23.7, 23.7, 23.64, 23.64, 23.64, 23.64, 23.64, 23.62, 23.62, 23.62, 23.62, 23.62, 23.39, 23.39, 23.39, 23.39, 23.39, 23.14, 23.14, 23.14, 23.14, 23.14, 23.13, 23.13, 23.13, 23.13, 23.13, 23.16, 23.16, 23.16, 23.16, 23.16, 23.21, 23.21, 23.21, 23.21, 23.21, 23.4, 23.4, 23.4, 23.4, 23.4, 23.39, 23.39, 23.39, 23.39, 23.39, 23.36, 23.36, 23.36, 23.36, 23.36, 23.3, 23.3, 23.3, 23.3, 23.3, 22.81, 22.81, 22.81, 22.81, 22.81, 22.81, 22.81, 22.81, 22.81, 22.81, 22.81, 22.81, 22.81, 22.81, 22.81, 22.67, 22.67, 22.67, 22.67, 22.67, 21.06, 21.06, 21.06, 21.06, 21.06, 20.65, 20.65, 20.65, 20.65, 20.65, 20.64, 20.64, 20.64, 20.64, 20.64, 20.59, 20.59, 20.59, 20.59, 20.59, 20.6, 20.6, 20.6, 20.6, 20.6, 20.65, 20.65, 20.65, 20.65, 20.65, 20.79, 20.79]
                    
Loading

Details

kv_cache_usage_ratio

More
---
config:
    xyChart:
        titleFontSize: 12
        width: 900
        height: 600
    themeVariables:
        xyChart:
            titleColor: "#000000"
---
xychart-beta
    title "llama.cpp bench-server-baseline on Standard_NC4as_T4_v3
 duration=10m 437 iterations"
    y-axis "llamacpp:kv_cache_usage_ratio"
    x-axis "llamacpp:kv_cache_usage_ratio" 1714338238 --> 1714338870
    line [0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.15, 0.15, 0.15, 0.15, 0.15, 0.34, 0.34, 0.34, 0.34, 0.34, 0.08, 0.08, 0.08, 0.08, 0.08, 0.18, 0.18, 0.18, 0.18, 0.18, 0.22, 0.22, 0.22, 0.22, 0.22, 0.08, 0.08, 0.08, 0.08, 0.08, 0.11, 0.11, 0.11, 0.11, 0.11, 0.11, 0.11, 0.11, 0.11, 0.11, 0.18, 0.18, 0.18, 0.18, 0.18, 0.17, 0.17, 0.17, 0.17, 0.17, 0.18, 0.18, 0.18, 0.18, 0.18, 0.19, 0.19, 0.19, 0.19, 0.19, 0.13, 0.13, 0.13, 0.13, 0.13, 0.32, 0.32, 0.32, 0.32, 0.32, 0.21, 0.21, 0.21, 0.21, 0.21, 0.29, 0.29, 0.29, 0.29, 0.29, 0.17, 0.17, 0.17, 0.17, 0.17, 0.11, 0.11, 0.11, 0.11, 0.11, 0.16, 0.16, 0.16, 0.16, 0.16, 0.28, 0.28, 0.28, 0.28, 0.28, 0.23, 0.23, 0.23, 0.23, 0.23, 0.28, 0.28, 0.28, 0.28, 0.28, 0.3, 0.3, 0.3, 0.3, 0.3, 0.16, 0.16, 0.16, 0.16, 0.16, 0.17, 0.17, 0.17, 0.17, 0.17, 0.16, 0.16, 0.16, 0.16, 0.16, 0.17, 0.17, 0.17, 0.17, 0.17, 0.12, 0.12, 0.12, 0.12, 0.12, 0.14, 0.14, 0.14, 0.14, 0.14, 0.29, 0.29, 0.29, 0.29, 0.29, 0.18, 0.18, 0.18, 0.18, 0.18, 0.22, 0.22, 0.22, 0.22, 0.22, 0.13, 0.13, 0.13, 0.13, 0.13, 0.1, 0.1, 0.1, 0.1, 0.1, 0.11, 0.11, 0.11, 0.11, 0.11, 0.11, 0.11, 0.11, 0.11, 0.11, 0.18, 0.18, 0.18, 0.18, 0.18, 0.16, 0.16, 0.16, 0.16, 0.16, 0.13, 0.13, 0.13, 0.13, 0.13, 0.15, 0.15, 0.15, 0.15, 0.15, 0.33, 0.33, 0.33, 0.33, 0.33, 0.2, 0.2, 0.2, 0.2, 0.2, 0.19, 0.19, 0.19, 0.19, 0.19, 0.14, 0.14, 0.14, 0.14, 0.14, 0.16, 0.16, 0.16, 0.16, 0.16, 0.11, 0.11, 0.11, 0.11, 0.11, 0.12, 0.12, 0.12, 0.12, 0.12, 0.36, 0.36, 0.36, 0.36, 0.36, 0.43, 0.43, 0.43, 0.43, 0.43, 0.51, 0.51, 0.51, 0.51, 0.51, 0.58, 0.58, 0.58, 0.58, 0.58, 0.63, 0.63, 0.63, 0.63, 0.63, 0.68, 0.68, 0.68, 0.68, 0.68, 0.64, 0.64, 0.64, 0.64, 0.64, 0.46, 0.46, 0.46, 0.46, 0.46, 0.12, 0.12, 0.12, 0.12, 0.12, 0.19, 0.19, 0.19, 0.19, 0.19, 0.17, 0.17, 0.17, 0.17, 0.17, 0.17, 0.17, 0.17, 0.17, 0.17, 0.12, 0.12, 0.12, 0.12, 0.12, 0.18, 0.18]
                    
Loading
requests_processing
More
---
config:
    xyChart:
        titleFontSize: 12
        width: 900
        height: 600
    themeVariables:
        xyChart:
            titleColor: "#000000"
---
xychart-beta
    title "llama.cpp bench-server-baseline on Standard_NC4as_T4_v3
 duration=10m 437 iterations"
    y-axis "llamacpp:requests_processing"
    x-axis "llamacpp:requests_processing" 1714338238 --> 1714338870
    line [0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 2.0, 2.0, 2.0, 2.0, 2.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 4.0, 4.0, 4.0, 4.0, 4.0, 6.0, 6.0, 6.0, 6.0, 6.0, 7.0, 7.0, 7.0, 7.0, 7.0, 6.0, 6.0, 6.0, 6.0, 6.0, 5.0, 5.0, 5.0, 5.0, 5.0, 6.0, 6.0, 6.0, 6.0, 6.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 5.0, 5.0, 5.0, 5.0, 5.0, 8.0, 8.0, 8.0, 8.0, 8.0, 7.0, 7.0, 7.0, 7.0, 7.0, 6.0, 6.0, 6.0, 6.0, 6.0, 7.0, 7.0, 7.0, 7.0, 7.0, 7.0, 7.0, 7.0, 7.0, 7.0, 7.0, 7.0, 7.0, 7.0, 7.0, 6.0, 6.0, 6.0, 6.0, 6.0, 7.0, 7.0, 7.0, 7.0, 7.0, 5.0, 5.0, 5.0, 5.0, 5.0, 7.0, 7.0, 7.0, 7.0, 7.0, 7.0, 7.0, 7.0, 7.0, 7.0, 6.0, 6.0, 6.0, 6.0, 6.0, 4.0, 4.0, 4.0, 4.0, 4.0, 3.0, 3.0, 3.0, 3.0, 3.0, 7.0, 7.0, 7.0, 7.0, 7.0, 3.0, 3.0, 3.0, 3.0, 3.0, 5.0, 5.0, 5.0, 5.0, 5.0, 7.0, 7.0, 7.0, 7.0, 7.0, 8.0, 8.0, 8.0, 8.0, 8.0, 7.0, 7.0, 7.0, 7.0, 7.0, 6.0, 6.0, 6.0, 6.0, 6.0, 6.0, 6.0, 6.0, 6.0, 6.0, 4.0, 4.0, 4.0, 4.0, 4.0, 5.0, 5.0, 5.0, 5.0, 5.0, 5.0, 5.0, 5.0, 5.0, 5.0, 7.0, 7.0, 7.0, 7.0, 7.0, 7.0, 7.0, 7.0, 7.0, 7.0, 7.0, 7.0, 7.0, 7.0, 7.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 7.0, 7.0, 7.0, 7.0, 7.0, 5.0, 5.0, 5.0, 5.0, 5.0, 8.0, 8.0, 8.0, 8.0, 8.0, 2.0, 2.0, 2.0, 2.0, 2.0, 5.0, 5.0, 5.0, 5.0, 5.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 5.0, 5.0, 5.0, 5.0, 5.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 7.0, 7.0, 7.0, 7.0, 7.0, 5.0, 5.0, 5.0, 5.0, 5.0, 5.0, 5.0, 5.0, 5.0, 5.0, 1.0, 1.0]
                    
Loading

@mann1x mann1x changed the title CPUSet support for Windows CPUSet support for Windows and Linux Apr 26, 2024
@mann1x
Copy link
Author

mann1x commented Apr 26, 2024

Looking for testers, I will open an issue to ask for help.

Found a limitation with Windows WSL; it reports almost perfectly the topology, except the last level cache, but it doesn't obey at all the affinity. Despite the process and the threads accepting the affinity and reporting it set alright, they are just run on random cores.
Anyone has an idea?

@cpumaxx
Copy link
Contributor

cpumaxx commented Apr 26, 2024

I am testing this branch. What flags would you say provide the best speedup on Linux?

@mann1x
Copy link
Author

mann1x commented Apr 27, 2024

I am testing this branch. What flags would you say provide the best speedup on Linux?

Ideally, no need to use any flag including -t.
In most of the cases, it will be the best, or almost the best, configuration.
There's no real all-around setting; it depends on your configuration and what you are doing (especially if the model is offloaded or not).

The default settings are skipping the first logical Core, ordering the cores from worst to best, skipping the non-physical cores, the E-Cores on Intel and the CCD jump on AMD processors.

The first thing you should notice if monitoring with htop is that the load will not be randomly put on all cores but it should be only on the first half of the cores (Linux puts the threaded cores on the second half, instead of 0/1, 2/3, like Windows), excluding the Core 0, and not on the E-Cores if they are present.

Skipping the Last Level cache doesn't work on Linux, so on AMD all the cores will be used if a 2nd CCD is present.

You can test if -t correctly uses only the requested number of cpus and the other options if they behave as expected by monitoring with htop

Compare with and without the patch and post the results if possible.


/**
* Returns number of CPUs on system that are useful for math.
*/
int get_math_cpu_count() {
#if defined(__x86_64__) && defined(__linux__)
#if defined(__x86_164__) && defined(__linux__)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is this a typo?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's definitely a typo

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Correcting it will skip the #elif below that #if for x86-64 Linux (may or may not be intended).

@cpumaxx
Copy link
Contributor

cpumaxx commented Apr 28, 2024

For my specific case it appears that a very simple command line (model, seed, prompt and token count) resulted in a tiny reduction of performance of about 2%. Caches were dropped before each run.
I'm running dual Epyc Genoa with 64 cores/128 threads. It pinned all threads to the "first" 64 cores, but due to the NUMA layout it probably wasn't ideal.
cpupinning-1

node 0 cpus: 0 1 2 3 4 5 6 7 64 65 66 67 68 69 70 71
node 0 size: 96461 MB
node 0 free: 38947 MB
node 1 cpus: 8 9 10 11 12 13 14 15 72 73 74 75 76 77 78 79
node 1 size: 96729 MB
node 1 free: 82217 MB
node 2 cpus: 16 17 18 19 20 21 22 23 80 81 82 83 84 85 86 87
node 2 size: 96763 MB
node 2 free: 79540 MB
node 3 cpus: 24 25 26 27 28 29 30 31 88 89 90 91 92 93 94 95
node 3 size: 96763 MB
node 3 free: 81738 MB
node 4 cpus: 32 33 34 35 36 37 38 39 96 97 98 99 100 101 102 103
node 4 size: 96763 MB
node 4 free: 82506 MB
node 5 cpus: 40 41 42 43 44 45 46 47 104 105 106 107 108 109 110 111
node 5 size: 96763 MB
node 5 free: 82413 MB
node 6 cpus: 48 49 50 51 52 53 54 55 112 113 114 115 116 117 118 119
node 6 size: 96763 MB
node 6 free: 82277 MB
node 7 cpus: 56 57 58 59 60 61 62 63 120 121 122 123 124 125 126 127
node 7 size: 96717 MB
node 7 free: 82411 MB
node distances:
node   0   1   2   3   4   5   6   7 
  0:  10  12  12  12  32  32  32  32 
  1:  12  10  12  12  32  32  32  32 
  2:  12  12  10  12  32  32  32  32 
  3:  12  12  12  10  32  32  32  32 
  4:  32  32  32  32  10  12  12  12 
  5:  32  32  32  32  12  10  12  12 
  6:  32  32  32  32  12  12  10  12 
  7:  32  32  32  32  12  12  12  10 

@mann1x
Copy link
Author

mann1x commented Apr 28, 2024

For my specific case it appears that a very simple command line (model, seed, prompt and token count) resulted in a tiny reduction of performance of about 2%. Caches were dropped before each run.
I'm running dual Epyc Genoa with 64 cores/128 threads. It pinned all threads to the "first" 64 cores, but due to the NUMA layout it probably wasn't ideal.

That's really a lot of CPUs :)
Thanks for testing, does the numa switch actually works?

Adding support for more than 64 CPUs is doable but a bit more complex, maybe I can add the numa selection if it works.
Do you know if those allocated were all physical cores or also 2nd threads?

@cpumaxx
Copy link
Contributor

cpumaxx commented Apr 28, 2024

That's really a lot of CPUs :) Thanks for testing, does the numa switch actually works?

Yes, the numa control flags work quite well, but they're mostly for isolating processes to subset of cores. I've documented a few use cases in https://rentry.org/miqumaxx

Adding support for more than 64 CPUs is doable but a bit more complex, maybe I can add the numa selection if it works. Do you know if those allocated were all physical cores or also 2nd threads?

According to the resource locality map in hwloc's lstopo util, I believe it was successfully targeting the first HT physical cores only:

Machine (756GB total)
  Package L#0
    L3 L#0 (32MB)
      NUMANode L#0 (P#0 94GB)
      L2 L#0 (1024KB) + L1d L#0 (32KB) + L1i L#0 (32KB) + Core L#0
        PU L#0 (P#0)
        PU L#1 (P#64)
      L2 L#1 (1024KB) + L1d L#1 (32KB) + L1i L#1 (32KB) + Core L#1
        PU L#2 (P#1)
        PU L#3 (P#65)
      L2 L#2 (1024KB) + L1d L#2 (32KB) + L1i L#2 (32KB) + Core L#2
        PU L#4 (P#2)
        PU L#5 (P#66)
      L2 L#3 (1024KB) + L1d L#3 (32KB) + L1i L#3 (32KB) + Core L#3
        PU L#6 (P#3)
        PU L#7 (P#67)
      L2 L#4 (1024KB) + L1d L#4 (32KB) + L1i L#4 (32KB) + Core L#4
        PU L#8 (P#4)
        PU L#9 (P#68)
      L2 L#5 (1024KB) + L1d L#5 (32KB) + L1i L#5 (32KB) + Core L#5
        PU L#10 (P#5)
        PU L#11 (P#69)
      L2 L#6 (1024KB) + L1d L#6 (32KB) + L1i L#6 (32KB) + Core L#6
        PU L#12 (P#6)
        PU L#13 (P#70)
      L2 L#7 (1024KB) + L1d L#7 (32KB) + L1i L#7 (32KB) + Core L#7
        PU L#14 (P#7)
        PU L#15 (P#71)
    L3 L#1 (32MB)
      NUMANode L#1 (P#1 94GB)
      L2 L#8 (1024KB) + L1d L#8 (32KB) + L1i L#8 (32KB) + Core L#8
        PU L#16 (P#8)
        PU L#17 (P#72)
      L2 L#9 (1024KB) + L1d L#9 (32KB) + L1i L#9 (32KB) + Core L#9
        PU L#18 (P#9)
        PU L#19 (P#73)
      L2 L#10 (1024KB) + L1d L#10 (32KB) + L1i L#10 (32KB) + Core L#10
        PU L#20 (P#10)
        PU L#21 (P#74)
      L2 L#11 (1024KB) + L1d L#11 (32KB) + L1i L#11 (32KB) + Core L#11
        PU L#22 (P#11)
        PU L#23 (P#75)
      L2 L#12 (1024KB) + L1d L#12 (32KB) + L1i L#12 (32KB) + Core L#12
        PU L#24 (P#12)
        PU L#25 (P#76)
      L2 L#13 (1024KB) + L1d L#13 (32KB) + L1i L#13 (32KB) + Core L#13
        PU L#26 (P#13)
        PU L#27 (P#77)
      L2 L#14 (1024KB) + L1d L#14 (32KB) + L1i L#14 (32KB) + Core L#14
        PU L#28 (P#14)
        PU L#29 (P#78)
      L2 L#15 (1024KB) + L1d L#15 (32KB) + L1i L#15 (32KB) + Core L#15
        PU L#30 (P#15)
        PU L#31 (P#79)
    L3 L#2 (32MB)
      NUMANode L#2 (P#2 94GB)
      L2 L#16 (1024KB) + L1d L#16 (32KB) + L1i L#16 (32KB) + Core L#16
        PU L#32 (P#16)
        PU L#33 (P#80)
      L2 L#17 (1024KB) + L1d L#17 (32KB) + L1i L#17 (32KB) + Core L#17
        PU L#34 (P#17)
        PU L#35 (P#81)
      L2 L#18 (1024KB) + L1d L#18 (32KB) + L1i L#18 (32KB) + Core L#18
        PU L#36 (P#18)
        PU L#37 (P#82)
      L2 L#19 (1024KB) + L1d L#19 (32KB) + L1i L#19 (32KB) + Core L#19
        PU L#38 (P#19)
        PU L#39 (P#83)
      L2 L#20 (1024KB) + L1d L#20 (32KB) + L1i L#20 (32KB) + Core L#20
        PU L#40 (P#20)
        PU L#41 (P#84)
      L2 L#21 (1024KB) + L1d L#21 (32KB) + L1i L#21 (32KB) + Core L#21
        PU L#42 (P#21)
        PU L#43 (P#85)
      L2 L#22 (1024KB) + L1d L#22 (32KB) + L1i L#22 (32KB) + Core L#22
        PU L#44 (P#22)
        PU L#45 (P#86)
      L2 L#23 (1024KB) + L1d L#23 (32KB) + L1i L#23 (32KB) + Core L#23
        PU L#46 (P#23)
        PU L#47 (P#87)
    L3 L#3 (32MB)
      NUMANode L#3 (P#3 94GB)
      L2 L#24 (1024KB) + L1d L#24 (32KB) + L1i L#24 (32KB) + Core L#24
        PU L#48 (P#24)
        PU L#49 (P#88)
      L2 L#25 (1024KB) + L1d L#25 (32KB) + L1i L#25 (32KB) + Core L#25
        PU L#50 (P#25)
        PU L#51 (P#89)
      L2 L#26 (1024KB) + L1d L#26 (32KB) + L1i L#26 (32KB) + Core L#26
        PU L#52 (P#26)
        PU L#53 (P#90)
      L2 L#27 (1024KB) + L1d L#27 (32KB) + L1i L#27 (32KB) + Core L#27
        PU L#54 (P#27)
        PU L#55 (P#91)
      L2 L#28 (1024KB) + L1d L#28 (32KB) + L1i L#28 (32KB) + Core L#28
        PU L#56 (P#28)
        PU L#57 (P#92)
      L2 L#29 (1024KB) + L1d L#29 (32KB) + L1i L#29 (32KB) + Core L#29
        PU L#58 (P#29)
        PU L#59 (P#93)
      L2 L#30 (1024KB) + L1d L#30 (32KB) + L1i L#30 (32KB) + Core L#30
        PU L#60 (P#30)
        PU L#61 (P#94)
      L2 L#31 (1024KB) + L1d L#31 (32KB) + L1i L#31 (32KB) + Core L#31
        PU L#62 (P#31)
        PU L#63 (P#95)
  Package L#1
    L3 L#4 (32MB)
      NUMANode L#4 (P#4 94GB)
      L2 L#32 (1024KB) + L1d L#32 (32KB) + L1i L#32 (32KB) + Core L#32
        PU L#64 (P#32)
        PU L#65 (P#96)
      L2 L#33 (1024KB) + L1d L#33 (32KB) + L1i L#33 (32KB) + Core L#33
        PU L#66 (P#33)
        PU L#67 (P#97)
      L2 L#34 (1024KB) + L1d L#34 (32KB) + L1i L#34 (32KB) + Core L#34
        PU L#68 (P#34)
        PU L#69 (P#98)
      L2 L#35 (1024KB) + L1d L#35 (32KB) + L1i L#35 (32KB) + Core L#35
        PU L#70 (P#35)
        PU L#71 (P#99)
      L2 L#36 (1024KB) + L1d L#36 (32KB) + L1i L#36 (32KB) + Core L#36
        PU L#72 (P#36)
        PU L#73 (P#100)
      L2 L#37 (1024KB) + L1d L#37 (32KB) + L1i L#37 (32KB) + Core L#37
        PU L#74 (P#37)
        PU L#75 (P#101)
      L2 L#38 (1024KB) + L1d L#38 (32KB) + L1i L#38 (32KB) + Core L#38
        PU L#76 (P#38)
        PU L#77 (P#102)
      L2 L#39 (1024KB) + L1d L#39 (32KB) + L1i L#39 (32KB) + Core L#39
        PU L#78 (P#39)
        PU L#79 (P#103)
    L3 L#5 (32MB)
      NUMANode L#5 (P#5 94GB)
      L2 L#40 (1024KB) + L1d L#40 (32KB) + L1i L#40 (32KB) + Core L#40
        PU L#80 (P#40)
        PU L#81 (P#104)
      L2 L#41 (1024KB) + L1d L#41 (32KB) + L1i L#41 (32KB) + Core L#41
        PU L#82 (P#41)
        PU L#83 (P#105)
      L2 L#42 (1024KB) + L1d L#42 (32KB) + L1i L#42 (32KB) + Core L#42
        PU L#84 (P#42)
        PU L#85 (P#106)
      L2 L#43 (1024KB) + L1d L#43 (32KB) + L1i L#43 (32KB) + Core L#43
        PU L#86 (P#43)
        PU L#87 (P#107)
      L2 L#44 (1024KB) + L1d L#44 (32KB) + L1i L#44 (32KB) + Core L#44
        PU L#88 (P#44)
        PU L#89 (P#108)
      L2 L#45 (1024KB) + L1d L#45 (32KB) + L1i L#45 (32KB) + Core L#45
        PU L#90 (P#45)
        PU L#91 (P#109)
      L2 L#46 (1024KB) + L1d L#46 (32KB) + L1i L#46 (32KB) + Core L#46
        PU L#92 (P#46)
        PU L#93 (P#110)
      L2 L#47 (1024KB) + L1d L#47 (32KB) + L1i L#47 (32KB) + Core L#47
        PU L#94 (P#47)
        PU L#95 (P#111)
    L3 L#6 (32MB)
      NUMANode L#6 (P#6 94GB)
      L2 L#48 (1024KB) + L1d L#48 (32KB) + L1i L#48 (32KB) + Core L#48
        PU L#96 (P#48)
        PU L#97 (P#112)
      L2 L#49 (1024KB) + L1d L#49 (32KB) + L1i L#49 (32KB) + Core L#49
        PU L#98 (P#49)
        PU L#99 (P#113)
      L2 L#50 (1024KB) + L1d L#50 (32KB) + L1i L#50 (32KB) + Core L#50
        PU L#100 (P#50)
        PU L#101 (P#114)
      L2 L#51 (1024KB) + L1d L#51 (32KB) + L1i L#51 (32KB) + Core L#51
        PU L#102 (P#51)
        PU L#103 (P#115)
      L2 L#52 (1024KB) + L1d L#52 (32KB) + L1i L#52 (32KB) + Core L#52
        PU L#104 (P#52)
        PU L#105 (P#116)
      L2 L#53 (1024KB) + L1d L#53 (32KB) + L1i L#53 (32KB) + Core L#53
        PU L#106 (P#53)
        PU L#107 (P#117)
      L2 L#54 (1024KB) + L1d L#54 (32KB) + L1i L#54 (32KB) + Core L#54
        PU L#108 (P#54)
        PU L#109 (P#118)
      L2 L#55 (1024KB) + L1d L#55 (32KB) + L1i L#55 (32KB) + Core L#55
        PU L#110 (P#55)
        PU L#111 (P#119)
    L3 L#7 (32MB)
      NUMANode L#7 (P#7 94GB)
      L2 L#56 (1024KB) + L1d L#56 (32KB) + L1i L#56 (32KB) + Core L#56
        PU L#112 (P#56)
        PU L#113 (P#120)
      L2 L#57 (1024KB) + L1d L#57 (32KB) + L1i L#57 (32KB) + Core L#57
        PU L#114 (P#57)
        PU L#115 (P#121)
      L2 L#58 (1024KB) + L1d L#58 (32KB) + L1i L#58 (32KB) + Core L#58
        PU L#116 (P#58)
        PU L#117 (P#122)
      L2 L#59 (1024KB) + L1d L#59 (32KB) + L1i L#59 (32KB) + Core L#59
        PU L#118 (P#59)
        PU L#119 (P#123)
      L2 L#60 (1024KB) + L1d L#60 (32KB) + L1i L#60 (32KB) + Core L#60
        PU L#120 (P#60)
        PU L#121 (P#124)
      L2 L#61 (1024KB) + L1d L#61 (32KB) + L1i L#61 (32KB) + Core L#61
        PU L#122 (P#61)
        PU L#123 (P#125)
      L2 L#62 (1024KB) + L1d L#62 (32KB) + L1i L#62 (32KB) + Core L#62
        PU L#124 (P#62)
        PU L#125 (P#126)
      L2 L#63 (1024KB) + L1d L#63 (32KB) + L1i L#63 (32KB) + Core L#63
        PU L#126 (P#63)
        PU L#127 (P#127)

I'm happy to keep testing this branch. Any speedup on numa systems is very interesting for me!

@mann1x
Copy link
Author

mann1x commented Apr 28, 2024

@cpumaxx
Can you try the latest version?
It should be able to allocate all the threads now.
The custom core bitmask is still limited to the first 64 CPUs.

@cpumaxx
Copy link
Contributor

cpumaxx commented Apr 28, 2024

I tried a git pull and found I already have the latest version of mannic-win32-cpuset (e5672d3), so my tests from yesterday are with what appears to be the latest code.
Or should I be on another branch or maybe there are uncommited changes?

@mann1x
Copy link
Author

mann1x commented Apr 28, 2024

I tried a git pull and found I already have the latest version of mannic-win32-cpuset (e5672d3), so my tests from yesterday are with what appears to be the latest code. Or should I be on another branch or maybe there are uncommited changes?

No I just pushed the wrong branch from the wrong location...
Please have a look at it now.

@cpumaxx
Copy link
Contributor

cpumaxx commented Apr 29, 2024

No I just pushed the wrong branch from the wrong location... Please have a look at it now.

I'm still seeing a 3% T/s slowdown vs. a fresh pull of the main branch with identical settings (forcing the same seed and automatically chosen number of cores). It may be that in my case using hyperthreaded cores is a net benefit
However, the larger problem is that it is only 25% of the speed vs. using a simple --numa distribute.
It makes sense that my setup is more sensitive to these kinds of tuning than most consumer ones.

@mann1x
Copy link
Author

mann1x commented Apr 29, 2024

I'm still seeing a 3% T/s slowdown vs. a fresh pull of the main branch with identical settings (forcing the same seed and automatically chosen number of cores). It may be that in my case using hyperthreaded cores is a net benefit However, the larger problem is that it is only 25% of the speed vs. using a simple --numa distribute. It makes sense that my setup is more sensitive to these kinds of tuning than most consumer ones.

Thanks, so you can confirm now is using the whole 128 threads with -atc 1?
I will try to harmonize with the --numa setting and the patch @bmtwl is working on.
The performance difference could come also from the changes in master.
They have been a lot already and I still couldn't find how to merge.

@cpumaxx
Copy link
Contributor

cpumaxx commented Apr 29, 2024

Thanks, so you can confirm now is using the whole 128 threads with -atc 1?

Yes, it does, but with the -atc 1 flag the performance absolutely collapses (only a third of the T/s compared to not using it) despite all 128 cores being pinned at 100% for the duration of inference.

They have been a lot already and I still couldn't find how to merge.

I just went through this with another PR. If you can't auto-merge in the github interface, you may have to peel out your diffs with "git diff" and re-apply them manually on a new branch

@mofosyne mofosyne added Review Complexity : Medium Generally require more time to grok but manageable by beginner to medium expertise level bugfix fixes an issue or bug labels May 9, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bugfix fixes an issue or bug Review Complexity : Medium Generally require more time to grok but manageable by beginner to medium expertise level
Projects
None yet
Development

Successfully merging this pull request may close these issues.

6 participants