-
Notifications
You must be signed in to change notification settings - Fork 12k
CPUSet support for Windows and Linux #6832
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: master
Are you sure you want to change the base?
Conversation
Some fixes and added new options:
@ggerganov |
common/common.cpp
Outdated
extern "C" | ||
NTSTATUS | ||
NTAPI | ||
NtQuerySystemInformationEx( | ||
_In_ SYSTEM_INFORMATION_CLASS SystemInformationClass, | ||
_In_reads_bytes_(InputBufferLength) PVOID InputBuffer, | ||
_In_ ULONG InputBufferLength, | ||
_Out_writes_bytes_opt_(SystemInformationLength) PVOID SystemInformation, | ||
_In_ ULONG SystemInformationLength, | ||
_Out_opt_ PULONG ReturnLength | ||
); | ||
|
||
|
||
extern "C" | ||
NTSTATUS | ||
NTAPI | ||
NtQueryInformationProcess( | ||
_In_ HANDLE ProcessHandle, | ||
_In_ PROCESSINFOCLASS ProcessInformationClass, | ||
_Out_writes_bytes_opt_(ProcessInformationLength) PVOID ProcessInformation, | ||
_In_ ULONG ProcessInformationLength, | ||
_Out_opt_ PULONG ReturnLength | ||
); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
These forward declarations don't seem to be used, so you should remove them.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
These forward declarations don't seem to be used, so you should remove them.
Thanks for noticing it, I was using it for something else that I removed later.
Right now I'm adding support for Linux.
Found out the implementation was bugged; there is a typo in the sys path and the affinity for the process is never set.
I wouldn't be able to support the same as Windows but except the last level cache traversal everything else should be fine.
Got a nice 10% t/s speed up with a 5600G on Debian.
This is quite a lot of code that I'm not familiar with - try to put it in separate This seems targeted for Windows - would be interested in more feedback from Windows users Do you expect any gains on Linux? |
I will try but I'm not really sure if I can do a good job. My knowledge is limited :)
Yes, I started it on Windows because there was no automatic selection of the physical cores. Different but similar, there are some limitations which I'm not yet sure I can overcome;
Otherwise all the other features will be available; the custom core mask, skipping the core 0, including the threaded cores. Feedback on Windows and on Linux too with the next commit would be really appreciated. The patch also fixes an issue with the
Same as for Windows, around 10%. |
@ggerganov |
Looking for testers, I will open an issue to ask for help. Found a limitation with Windows WSL; it reports almost perfectly the topology, except the last level cache, but it doesn't obey at all the affinity. Despite the process and the threads accepting the affinity and reporting it set alright, they are just run on random cores. |
I am testing this branch. What flags would you say provide the best speedup on Linux? |
Ideally, no need to use any flag including The default settings are skipping the first logical Core, ordering the cores from worst to best, skipping the non-physical cores, the E-Cores on Intel and the CCD jump on AMD processors. The first thing you should notice if monitoring with Skipping the Last Level cache doesn't work on Linux, so on AMD all the cores will be used if a 2nd CCD is present. You can test if Compare with and without the patch and post the results if possible. |
common/common.cpp
Outdated
|
||
/** | ||
* Returns number of CPUs on system that are useful for math. | ||
*/ | ||
int get_math_cpu_count() { | ||
#if defined(__x86_64__) && defined(__linux__) | ||
#if defined(__x86_164__) && defined(__linux__) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is this a typo?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It's definitely a typo
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Correcting it will skip the #elif
below that #if
for x86-64 Linux (may or may not be intended).
That's really a lot of CPUs :) Adding support for more than 64 CPUs is doable but a bit more complex, maybe I can add the numa selection if it works. |
Yes, the numa control flags work quite well, but they're mostly for isolating processes to subset of cores. I've documented a few use cases in https://rentry.org/miqumaxx
According to the resource locality map in hwloc's lstopo util, I believe it was successfully targeting the first HT physical cores only:
I'm happy to keep testing this branch. Any speedup on numa systems is very interesting for me! |
@cpumaxx |
I tried a git pull and found I already have the latest version of mannic-win32-cpuset (e5672d3), so my tests from yesterday are with what appears to be the latest code. |
No I just pushed the wrong branch from the wrong location... |
I'm still seeing a 3% T/s slowdown vs. a fresh pull of the main branch with identical settings (forcing the same seed and automatically chosen number of cores). It may be that in my case using hyperthreaded cores is a net benefit |
Thanks, so you can confirm now is using the whole 128 threads with |
Yes, it does, but with the
I just went through this with another PR. If you can't auto-merge in the github interface, you may have to peel out your diffs with "git diff" and re-apply them manually on a new branch |
This patch is a WIP and very likely bugged here and there,
Still it's already functional and seems to do what is supposed to do.
This patch only supports Windows and it's limited to processors from 4 to 64 logical cores.
Problems addressed:
The main goal is to limit the unnecessary system load (eg. over 6 cores there's no scaling up on my 5950X, same performances as with 16 cores, 8 cores on the 2nd CCD are 10% faster with 5W less power consumption and half the system load).
At the same time excluding Core 0 means having always a reactive system and constant throughput from lllama.cpp.
Speed increase with GPU Offloading is minimal, about 1-2 t/s, but the system will be more reactive especially with partial offloading.
Two command line options and parameters for the context have been added:
-bco
: Best Core Order, set to 1 will invert the default order and the cores will be selected from the best to the worst-llct
: Last Level Cache Traversal, set to 1 will allow the core selection to traverse the Last Level cache index