Skip to content

JIT: Fix SIGSEGV issue while running hyperf-skeleton #10658

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
wants to merge 1 commit into from

Conversation

bjzhjing
Copy link

@bjzhjing bjzhjing commented Feb 22, 2023

While benchmark hyperf-skeleton with JIT tracing enabled, the following message is thrown in the output of hyperf:

Server::check_worker_exit_status(): worker(pid=70835, id=7) \ abnormal exit, status=0, signal=11

Checking the core dump file, it's seen that:

Thread 1 "php" received signal SIGSEGV, Segmentation fault. 0x00005570c716b034 in zend_fetch_ce_from_cache_slot (cache_slot=0x8,\
 type=0x7fa96f368d18) at /home/sdp/php/php-src/Zend/zend_execute.c:980
980             if (EXPECTED(HAVE_CACHE_SLOT && *cache_slot)) {

EX(run_time_cache) is NULL which directly causes the issue.

With JIT tracing enabled, run_time_cache is loaded in zend_jit_do_fcall(). For the case that run_time_cache is an offset, if we load it from EX:RX->func, rather than func, the SIGSEGV will be resolved. So it seems that EX:RX->func->op_array->run_time_cache__ptr and func->op_array->run_time_cache__ptr does not always keep accordance with each other during runtime. It's safe to load from EX:RX->func.

This fix is verified on PHP-8.2.3 and swoole v5.0.2, benchmark hyperf-skeleton with:
wrk -t 10 -c 500 -d 300 http://127.0.0.1:9501/
With this fix, hyperf-skeleton can gain around 9% RPS gain with JIT tracing enabled.

While benchmark hyperf-skeleton with JIT tracing enabled, the following
message is thrown in the output of hyperf:

Server::check_worker_exit_status(): worker(pid=70835, id=7) \
abnormal exit, status=0, signal=11

Checking the core dump file, it's seen that:

Thread 1 "php" received signal SIGSEGV, Segmentation fault.
0x00005570c716b034 in zend_fetch_ce_from_cache_slot (cache_slot=0x8,\
 type=0x7fa96f368d18) at /home/sdp/php/php-src/Zend/zend_execute.c:980
980             if (EXPECTED(HAVE_CACHE_SLOT && *cache_slot)) {

EX(run_time_cache) is NULL which directly causes the issue.

With JIT tracing enabled, run_time_cache is loaded in
zend_jit_do_fcall(). For the case that run_time_cache is an offset,
if we load it from EX:RX->func, rather than func, the SIGSEGV will be
resolved. So it seems that EX:RX->func->op_array->run_time_cache__ptr
and func->op_array->run_time_cache__ptr does not always keep accordance
with each other during runtime. It's safe to load from EX:RX->func.

This fix is verified on PHP-8.2.3 and swoole v5.0.2, with benchmark:
  wrk -t 10 -c 500 -d 300  http://127.0.0.1:9501/
With this fix, hyperf-skeleton can gain around 9% RPS gain with JIT
tracing enabled.

Signed-off-by: Cathy Zhang <[email protected]>
@dstogov
Copy link
Member

dstogov commented Feb 22, 2023

@bjzhjing can you please provide a test php script?

If this is reproducible only with swoole, please give me the complete instruction, how to reproduce the failure.

@bjzhjing
Copy link
Author

@dstogov Thanks for the quick response! The attachment is a details on how to reproduce this issue.
reproduce-sigsegv.zip
Please let me know if there is any step I do not explain clear.

@bjzhjing
Copy link
Author

bjzhjing commented Feb 23, 2023

From core dump file, it can be seen that the call failure is in loadFromSwooleRequest().

Core was generated by `skeleton.Worker.7                       '.
Program terminated with signal SIGSEGV, Segmentation fault.
#0  0x000055bbd0da79be in zend_fetch_ce_from_cache_slot (cache_slot=0x8, type=0x55bbd44120a0)
    at /home/sdp/php/php-src/Zend/zend_execute.c:1009
1009            if (EXPECTED(HAVE_CACHE_SLOT && *cache_slot)) {
[Current thread is 1 (Thread 0x7f61b5b51280 (LWP 3029790))]
(gdb) print ((char*)executor_globals.current_execute_data->func->common->function_name->val)
$1 = 0x55bbd31b2c70 "getUriFromGlobals"
(gdb) print ((char*)executor_globals.current_execute_data->prev_execute_data->func->common->function_name->val)
$2 = 0x55bbd31a9cf0 "loadFromSwooleRequest"

loadFromSwooleRequest() can be found in hyperf-skeleton/vendor/hyperf/http-message/src/Server/Request.php.

    public static function loadFromSwooleRequest(Swoole\Http\Request $swooleRequest)
    {
        $server = $swooleRequest->server;
        $method = $server['request_method'] ?? 'GET';
        $headers = $swooleRequest->header ?? [];
        $uri = self::getUriFromGlobals($swooleRequest);
        $body = new SwooleStream((string) $swooleRequest->rawContent());
        $protocol = isset($server['server_protocol']) ? str_replace('HTTP/', '', $server['server_protocol']) : '1.1';
        $request = new Request($method, $uri, $headers, $body, $protocol);
        $request->cookieParams = ($swooleRequest->cookie ?? []);
        $request->queryParams = ($swooleRequest->get ?? []);
        $request->serverParams = ($server ?? []);
        $request->parsedBody = self::normalizeParsedBody($swooleRequest->post ?? [], $request);
        $request->uploadedFiles = self::normalizeFiles($swooleRequest->files ?? []);
        $request->swooleRequest = $swooleRequest;
        return $request;
    }

Both self::getUriFromGlobals($swooleRequest) and self::normalizeParsedBody($swooleRequest->post ?? [], $request) are failure points ever found in dump files.

I've done another try, that make change in loadFromSwooleRequest() by using static:: instead of self:: to call those methods while no change from PHP JIT side, there is no SIGSEGV issue. static:: will refer to the currently running class while self:: will refer to the class in which method is defined. I think it is related to the run_time_cache loading in zend_jit_do_fcall() where I changed in this PR.

@bjzhjing
Copy link
Author

@dstogov Hello, are you able to reproduce this issue?

@dstogov
Copy link
Member

dstogov commented Feb 27, 2023

At first, I see a bug in swoole that is visible with opcache.protect_memory=1. Swoole changes data in opcache shared memory and this may cause unexpected behavior. It must check GC_IMMUTABLE flag and prevent changes e.g. by adding if (!IS_INTERNED(str)) { in zend::String::rtrim().

==1094138==ERROR: AddressSanitizer: SEGV on unknown address 0x000007d91bb0 (pc 0x7fc0139b3b27 bp 0x7fc0091ead70 sp 0x7fc0091ead50 T0)
==1094138==The signal is caused by a WRITE memory access.
    #0 0x7fc0139b3b27 in zend::String::rtrim() /home/dmitry/php/swoole-src/ext-src/php_swoole_cxx.h:219
    #1 0x7fc0139d1848 in operator() /home/dmitry/php/swoole-src/ext-src/swoole_http_response.cc:340
    #2 0x7fc0139d1fa7 in swoole::http::Context::build_header(swoole::String*, char const*, unsigned long) /home/dmitry/php/swoole-src/ext-src/swoole_http_response.cc:383
    #3 0x7fc0139d4591 in swoole::http::Context::end(_zval_struct*, _zval_struct*) /home/dmitry/php/swoole-src/ext-src/swoole_http_response.cc:787
    #4 0x7fc0139d3cb7 in zim_swoole_http_response_end /home/dmitry/php/swoole-src/ext-src/swoole_http_response.cc:678
    #5 0x21373cd in ZEND_DO_FCALL_SPEC_RETVAL_UNUSED_HANDLER /home/dmitry/php/php8.2/Zend/zend_vm_execute.h:1842
    #6 0x23bf54a in execute_ex /home/dmitry/php/php8.2/Zend/zend_vm_execute.h:56064
    #7 0x7fc01397889a in swoole::PHPCoroutine::main_func(void*) /home/dmitry/php/swoole-src/ext-src/swoole_coroutine.cc:726
    #8 0x7fc013982da3 in void std::__invoke_impl<void, void (*&)(void*), void*>(std::__invoke_other, void (*&)(void*), void*&&) /usr/include/c++/12/bits/invoke.h:61
    #9 0x7fc013981779 in std::enable_if<std::__and_<std::is_void<void>, std::__is_invocable<void (*&)(void*), void*> >::value, void>::type std::__invoke_r<void, void (*&)(void*), void*>(void (*&)(void*), void*&&) /usr/include/c++/12/bits/invoke.h:154
    #10 0x7fc013980892 in std::_Function_handler<void (void*), void (*)(void*)>::_M_invoke(std::_Any_data const&, void*&&) /usr/include/c++/12/bits/std_function.h:290
    #11 0x7fc0139179ae in std::function<void (void*)>::operator()(void*) const /usr/include/c++/12/bits/std_function.h:591
    #12 0x7fc013ab8cd8 in swoole::coroutine::Context::context_func(void*) /home/dmitry/php/swoole-src/src/coroutine/context.cc:142
    #13 0x7fc013baf2b0 in swoole_make_fcontext (/home/dmitry/php/usr/php8.2-debug-64/lib/php/extensions/debug-non-zts-20220829/swoole.so+0x5af2b0)

I'm continue investigation...

@dstogov
Copy link
Member

dstogov commented Feb 27, 2023

Thanks for the report and an attempt to fix this.

I can reproduce the crash, and I see that you patch fixes it, but it fixes only a side effect of the problem. I need to understand the real reason of the problem.

It looks like, this is related to some race condition that violates normal synchronization. This may be because of some swoole trick or may be a general php bug. It looks like different workers inconsistently update op_cache.run_time_cache__ptr or something CG(map_ptr...) related.

Can I limit the number of workes of hyperf.php?

@bjzhjing
Copy link
Author

At first, I see a bug in swoole that is visible with opcache.protect_memory=1. Swoole changes data in opcache shared memory and this may cause unexpected behavior. It must check GC_IMMUTABLE flag and prevent changes e.g. by adding if (!IS_INTERNED(str)) { in zend::String::rtrim().

==1094138==ERROR: AddressSanitizer: SEGV on unknown address 0x000007d91bb0 (pc 0x7fc0139b3b27 bp 0x7fc0091ead70 sp 0x7fc0091ead50 T0)
==1094138==The signal is caused by a WRITE memory access.
    #0 0x7fc0139b3b27 in zend::String::rtrim() /home/dmitry/php/swoole-src/ext-src/php_swoole_cxx.h:219
    #1 0x7fc0139d1848 in operator() /home/dmitry/php/swoole-src/ext-src/swoole_http_response.cc:340
    #2 0x7fc0139d1fa7 in swoole::http::Context::build_header(swoole::String*, char const*, unsigned long) /home/dmitry/php/swoole-src/ext-src/swoole_http_response.cc:383
    #3 0x7fc0139d4591 in swoole::http::Context::end(_zval_struct*, _zval_struct*) /home/dmitry/php/swoole-src/ext-src/swoole_http_response.cc:787
    #4 0x7fc0139d3cb7 in zim_swoole_http_response_end /home/dmitry/php/swoole-src/ext-src/swoole_http_response.cc:678
    #5 0x21373cd in ZEND_DO_FCALL_SPEC_RETVAL_UNUSED_HANDLER /home/dmitry/php/php8.2/Zend/zend_vm_execute.h:1842
    #6 0x23bf54a in execute_ex /home/dmitry/php/php8.2/Zend/zend_vm_execute.h:56064
    #7 0x7fc01397889a in swoole::PHPCoroutine::main_func(void*) /home/dmitry/php/swoole-src/ext-src/swoole_coroutine.cc:726
    #8 0x7fc013982da3 in void std::__invoke_impl<void, void (*&)(void*), void*>(std::__invoke_other, void (*&)(void*), void*&&) /usr/include/c++/12/bits/invoke.h:61
    #9 0x7fc013981779 in std::enable_if<std::__and_<std::is_void<void>, std::__is_invocable<void (*&)(void*), void*> >::value, void>::type std::__invoke_r<void, void (*&)(void*), void*>(void (*&)(void*), void*&&) /usr/include/c++/12/bits/invoke.h:154
    #10 0x7fc013980892 in std::_Function_handler<void (void*), void (*)(void*)>::_M_invoke(std::_Any_data const&, void*&&) /usr/include/c++/12/bits/std_function.h:290
    #11 0x7fc0139179ae in std::function<void (void*)>::operator()(void*) const /usr/include/c++/12/bits/std_function.h:591
    #12 0x7fc013ab8cd8 in swoole::coroutine::Context::context_func(void*) /home/dmitry/php/swoole-src/src/coroutine/context.cc:142
    #13 0x7fc013baf2b0 in swoole_make_fcontext (/home/dmitry/php/usr/php8.2-debug-64/lib/php/extensions/debug-non-zts-20220829/swoole.so+0x5af2b0)

I'm continue investigation...

Thanks for showing the logs! Do you re-compile php with -fsanitize=address to get these details? Yes, I've noticed opcache.protect_memory=1 issue, but I'm not aware of using AddressSanitizer.

@bjzhjing
Copy link
Author

bjzhjing commented Feb 27, 2023

Thanks for the report and an attempt to fix this.

I can reproduce the crash, and I see that you patch fixes it, but it fixes only a side effect of the problem. I need to understand the real reason of the problem.

It looks like, this is related to some race condition that violates normal synchronization. This may be because of some swoole trick or may be a general php bug. It looks like different workers inconsistently update op_cache.run_time_cache__ptr or something CG(map_ptr...) related.

Can I limit the number of workes of hyperf.php?

The number of workers Constant::OPTION_WORKER_NUM is configured in file hyperf-skeleton/config/autoload/server.php:

    'settings' => [
        Constant::OPTION_ENABLE_COROUTINE => true,
        Constant::OPTION_WORKER_NUM => 10,
        Constant::OPTION_PID_FILE => BASE_PATH . '/runtime/hyperf.pid',
        Constant::OPTION_OPEN_TCP_NODELAY => true,
        Constant::OPTION_MAX_COROUTINE => 100000,
        Constant::OPTION_OPEN_HTTP2_PROTOCOL => true,
        Constant::OPTION_MAX_REQUEST => 100000,
        Constant::OPTION_SOCKET_BUFFER_SIZE => 2 * 1024 * 1024,
        Constant::OPTION_BUFFER_OUTPUT_SIZE => 2 * 1024 * 1024,
        Constant::OPTION_OPEN_CPU_AFFINITY => true,
    ],

I change it as 10 in the above example. If the number is 2, there is no SIGSEGV issue, but if it's 3, the issue happens in my environment. It might be a race condition problem.

@dstogov
Copy link
Member

dstogov commented Feb 27, 2023

The real bug, should be fixed by #10719
Thank you very much for this test case.
I'm going to merge it after additional testing.
If you may check the fix, please confirm it fixes this problem.

@dstogov
Copy link
Member

dstogov commented Feb 27, 2023

The fix is merged into PHP-8.2 via 2e3fc8c
Please test and reopen the issue if you see problem. In my local tests everything works fine now.

@bjzhjing
Copy link
Author

@dstogov Verfied, the fix works well! Cool! I've also doubted the inheritance cache, but I'm not familiar with the code part :-)
Can I ask some questions regarding this? If inheritance cache gets NULL, does it mean func->op_array->run_time_cache__ptr is NULL? and why EX:RX->func->op_array->run_time_cache__ptr is not NULL? I think it's not, otherwise, my change can not avoid this issue.

@dstogov
Copy link
Member

dstogov commented Feb 28, 2023

If inheritance cache gets NULL, does it mean func->op_array->run_time_cache__ptr is NULL? and why EX:RX->func->op_array->run_time_cache__ptr is not NULL? I think it's not, otherwise, my change can not avoid this issue.

The connection between run_time_cache__ptr and inheritance cache is not direct. Actually, because of a bug, on some race condition we might get several copies of the same linked class. Then JIT-ed code compiled by different worker and having different copies in mind, initialized methods of one copy but then expected the other. This is what you had found. op_array(s) used durig JIT compilation and during execution got different values of run_time_cache__ptr.

@bjzhjing
Copy link
Author

If inheritance cache gets NULL, does it mean func->op_array->run_time_cache__ptr is NULL? and why EX:RX->func->op_array->run_time_cache__ptr is not NULL? I think it's not, otherwise, my change can not avoid this issue.

The connection between run_time_cache__ptr and inheritance cache is not direct. Actually, because of a bug, on some race condition we might get several copies of the same linked class. Then JIT-ed code compiled by different worker and having different copies in mind, initialized methods of one copy but then expected the other. This is what you had found. op_array(s) used durig JIT compilation and during execution got different values of run_time_cache__ptr.

So there is indeed a race condition issue, right? Is it in swoole or php? I've ever tried to enable ZTS, but the issue still exists in that case. Acutally, I'm a bit confused here, if there is still race condition, why you say the change in inheritance cache is the real fix for this one? Please forgive if you think I have too many questions, I'm curious about this but I'm not so familiar with the code part.

@dstogov
Copy link
Member

dstogov commented Feb 28, 2023

So there is indeed a race condition issue, right? Is it in swoole or php? I've ever tried to enable ZTS, but the issue still exists in that case. Acutally, I'm a bit confused here, if there is still race condition, why you say the change in inheritance cache is the real fix for this one? Please forgive if you think I have too many questions, I'm curious about this but I'm not so familiar with the code part.

The error was in PHP.

The inheritance cache should keep a single copy of each linked class. It first checks if we already have a linked class in the cache, than obtains a lock, checks the cache once again (because different worker might cache it at the same time), than stores new linked class in the cache, and releases the lock. Unfortunately, we had a bug at the second check and this leaded to problems.

@bjzhjing
Copy link
Author

The inheritance cache should keep a single copy of each linked class. It first checks if we already have a linked class in the cache, than obtains a lock, checks the cache once again (because different worker might cache it at the same time), than stores new linked class in the cache, and releases the lock. Unfortunately, we had a bug at the second check and this leaded to problems.

Ok, I see, so the bug mentioned here is what you fixed for this issue. I' ve got a general idea for the process. Thanks a bunch for the explanation!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants