-
Notifications
You must be signed in to change notification settings - Fork 7.9k
JIT: Fix SIGSEGV issue while running hyperf-skeleton #10658
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
While benchmark hyperf-skeleton with JIT tracing enabled, the following message is thrown in the output of hyperf: Server::check_worker_exit_status(): worker(pid=70835, id=7) \ abnormal exit, status=0, signal=11 Checking the core dump file, it's seen that: Thread 1 "php" received signal SIGSEGV, Segmentation fault. 0x00005570c716b034 in zend_fetch_ce_from_cache_slot (cache_slot=0x8,\ type=0x7fa96f368d18) at /home/sdp/php/php-src/Zend/zend_execute.c:980 980 if (EXPECTED(HAVE_CACHE_SLOT && *cache_slot)) { EX(run_time_cache) is NULL which directly causes the issue. With JIT tracing enabled, run_time_cache is loaded in zend_jit_do_fcall(). For the case that run_time_cache is an offset, if we load it from EX:RX->func, rather than func, the SIGSEGV will be resolved. So it seems that EX:RX->func->op_array->run_time_cache__ptr and func->op_array->run_time_cache__ptr does not always keep accordance with each other during runtime. It's safe to load from EX:RX->func. This fix is verified on PHP-8.2.3 and swoole v5.0.2, with benchmark: wrk -t 10 -c 500 -d 300 http://127.0.0.1:9501/ With this fix, hyperf-skeleton can gain around 9% RPS gain with JIT tracing enabled. Signed-off-by: Cathy Zhang <[email protected]>
@bjzhjing can you please provide a test php script? If this is reproducible only with swoole, please give me the complete instruction, how to reproduce the failure. |
@dstogov Thanks for the quick response! The attachment is a details on how to reproduce this issue. |
From core dump file, it can be seen that the call failure is in loadFromSwooleRequest().
loadFromSwooleRequest() can be found in hyperf-skeleton/vendor/hyperf/http-message/src/Server/Request.php.
Both self::getUriFromGlobals($swooleRequest) and self::normalizeParsedBody($swooleRequest->post ?? [], $request) are failure points ever found in dump files. I've done another try, that make change in loadFromSwooleRequest() by using |
@dstogov Hello, are you able to reproduce this issue? |
At first, I see a bug in swoole that is visible with
I'm continue investigation... |
Thanks for the report and an attempt to fix this. I can reproduce the crash, and I see that you patch fixes it, but it fixes only a side effect of the problem. I need to understand the real reason of the problem. It looks like, this is related to some race condition that violates normal synchronization. This may be because of some swoole trick or may be a general php bug. It looks like different workers inconsistently update op_cache.run_time_cache__ptr or something Can I limit the number of workes of hyperf.php? |
Thanks for showing the logs! Do you re-compile php with -fsanitize=address to get these details? Yes, I've noticed |
The number of workers
I change it as 10 in the above example. If the number is 2, there is no SIGSEGV issue, but if it's 3, the issue happens in my environment. It might be a race condition problem. |
The real bug, should be fixed by #10719 |
The fix is merged into PHP-8.2 via 2e3fc8c |
@dstogov Verfied, the fix works well! Cool! I've also doubted the inheritance cache, but I'm not familiar with the code part :-) |
The connection between |
So there is indeed a race condition issue, right? Is it in swoole or php? I've ever tried to enable ZTS, but the issue still exists in that case. Acutally, I'm a bit confused here, if there is still race condition, why you say the change in inheritance cache is the real fix for this one? Please forgive if you think I have too many questions, I'm curious about this but I'm not so familiar with the code part. |
The error was in PHP. The inheritance cache should keep a single copy of each linked class. It first checks if we already have a linked class in the cache, than obtains a lock, checks the cache once again (because different worker might cache it at the same time), than stores new linked class in the cache, and releases the lock. Unfortunately, we had a bug at the second check and this leaded to problems. |
Ok, I see, so the bug mentioned here is what you fixed for this issue. I' ve got a general idea for the process. Thanks a bunch for the explanation! |
While benchmark hyperf-skeleton with JIT tracing enabled, the following message is thrown in the output of hyperf:
Server::check_worker_exit_status(): worker(pid=70835, id=7) \ abnormal exit, status=0, signal=11
Checking the core dump file, it's seen that:
EX(run_time_cache) is NULL which directly causes the issue.
With JIT tracing enabled, run_time_cache is loaded in zend_jit_do_fcall(). For the case that run_time_cache is an offset, if we load it from EX:RX->func, rather than func, the SIGSEGV will be resolved. So it seems that EX:RX->func->op_array->run_time_cache__ptr and func->op_array->run_time_cache__ptr does not always keep accordance with each other during runtime. It's safe to load from EX:RX->func.
This fix is verified on PHP-8.2.3 and swoole v5.0.2, benchmark hyperf-skeleton with:
wrk -t 10 -c 500 -d 300 http://127.0.0.1:9501/
With this fix, hyperf-skeleton can gain around 9% RPS gain with JIT tracing enabled.