JIT: Fix SIGSEGV issue while running hyperf-skeleton #10658

bjzhjing · 2023-02-22T04:34:45Z

While benchmark hyperf-skeleton with JIT tracing enabled, the following message is thrown in the output of hyperf:

Server::check_worker_exit_status(): worker(pid=70835, id=7) \ abnormal exit, status=0, signal=11

Checking the core dump file, it's seen that:

Thread 1 "php" received signal SIGSEGV, Segmentation fault. 0x00005570c716b034 in zend_fetch_ce_from_cache_slot (cache_slot=0x8,\
 type=0x7fa96f368d18) at /home/sdp/php/php-src/Zend/zend_execute.c:980
980             if (EXPECTED(HAVE_CACHE_SLOT && *cache_slot)) {

EX(run_time_cache) is NULL which directly causes the issue.

With JIT tracing enabled, run_time_cache is loaded in zend_jit_do_fcall(). For the case that run_time_cache is an offset, if we load it from EX:RX->func, rather than func, the SIGSEGV will be resolved. So it seems that EX:RX->func->op_array->run_time_cache__ptr and func->op_array->run_time_cache__ptr does not always keep accordance with each other during runtime. It's safe to load from EX:RX->func.

This fix is verified on PHP-8.2.3 and swoole v5.0.2, benchmark hyperf-skeleton with:
wrk -t 10 -c 500 -d 300 http://127.0.0.1:9501/
With this fix, hyperf-skeleton can gain around 9% RPS gain with JIT tracing enabled.

While benchmark hyperf-skeleton with JIT tracing enabled, the following message is thrown in the output of hyperf: Server::check_worker_exit_status(): worker(pid=70835, id=7) \ abnormal exit, status=0, signal=11 Checking the core dump file, it's seen that: Thread 1 "php" received signal SIGSEGV, Segmentation fault. 0x00005570c716b034 in zend_fetch_ce_from_cache_slot (cache_slot=0x8,\ type=0x7fa96f368d18) at /home/sdp/php/php-src/Zend/zend_execute.c:980 980 if (EXPECTED(HAVE_CACHE_SLOT && *cache_slot)) { EX(run_time_cache) is NULL which directly causes the issue. With JIT tracing enabled, run_time_cache is loaded in zend_jit_do_fcall(). For the case that run_time_cache is an offset, if we load it from EX:RX->func, rather than func, the SIGSEGV will be resolved. So it seems that EX:RX->func->op_array->run_time_cache__ptr and func->op_array->run_time_cache__ptr does not always keep accordance with each other during runtime. It's safe to load from EX:RX->func. This fix is verified on PHP-8.2.3 and swoole v5.0.2, with benchmark: wrk -t 10 -c 500 -d 300 http://127.0.0.1:9501/ With this fix, hyperf-skeleton can gain around 9% RPS gain with JIT tracing enabled. Signed-off-by: Cathy Zhang <[email protected]>

dstogov · 2023-02-22T14:47:51Z

@bjzhjing can you please provide a test php script?

If this is reproducible only with swoole, please give me the complete instruction, how to reproduce the failure.

bjzhjing · 2023-02-23T04:20:13Z

@dstogov Thanks for the quick response! The attachment is a details on how to reproduce this issue.
reproduce-sigsegv.zip
Please let me know if there is any step I do not explain clear.

bjzhjing · 2023-02-23T05:30:23Z

From core dump file, it can be seen that the call failure is in loadFromSwooleRequest().

Core was generated by `skeleton.Worker.7                       '.
Program terminated with signal SIGSEGV, Segmentation fault.
#0  0x000055bbd0da79be in zend_fetch_ce_from_cache_slot (cache_slot=0x8, type=0x55bbd44120a0)
    at /home/sdp/php/php-src/Zend/zend_execute.c:1009
1009            if (EXPECTED(HAVE_CACHE_SLOT && *cache_slot)) {
[Current thread is 1 (Thread 0x7f61b5b51280 (LWP 3029790))]
(gdb) print ((char*)executor_globals.current_execute_data->func->common->function_name->val)
$1 = 0x55bbd31b2c70 "getUriFromGlobals"
(gdb) print ((char*)executor_globals.current_execute_data->prev_execute_data->func->common->function_name->val)
$2 = 0x55bbd31a9cf0 "loadFromSwooleRequest"

loadFromSwooleRequest() can be found in hyperf-skeleton/vendor/hyperf/http-message/src/Server/Request.php.

    public static function loadFromSwooleRequest(Swoole\Http\Request $swooleRequest)
    {
        $server = $swooleRequest->server;
        $method = $server['request_method'] ?? 'GET';
        $headers = $swooleRequest->header ?? [];
        $uri = self::getUriFromGlobals($swooleRequest);
        $body = new SwooleStream((string) $swooleRequest->rawContent());
        $protocol = isset($server['server_protocol']) ? str_replace('HTTP/', '', $server['server_protocol']) : '1.1';
        $request = new Request($method, $uri, $headers, $body, $protocol);
        $request->cookieParams = ($swooleRequest->cookie ?? []);
        $request->queryParams = ($swooleRequest->get ?? []);
        $request->serverParams = ($server ?? []);
        $request->parsedBody = self::normalizeParsedBody($swooleRequest->post ?? [], $request);
        $request->uploadedFiles = self::normalizeFiles($swooleRequest->files ?? []);
        $request->swooleRequest = $swooleRequest;
        return $request;
    }

Both self::getUriFromGlobals($swooleRequest) and self::normalizeParsedBody($swooleRequest->post ?? [], $request) are failure points ever found in dump files.

I've done another try, that make change in loadFromSwooleRequest() by using static:: instead of self:: to call those methods while no change from PHP JIT side, there is no SIGSEGV issue. static:: will refer to the currently running class while self:: will refer to the class in which method is defined. I think it is related to the run_time_cache loading in zend_jit_do_fcall() where I changed in this PR.

bjzhjing · 2023-02-27T02:14:11Z

@dstogov Hello, are you able to reproduce this issue?

dstogov · 2023-02-27T10:53:24Z

At first, I see a bug in swoole that is visible with opcache.protect_memory=1. Swoole changes data in opcache shared memory and this may cause unexpected behavior. It must check GC_IMMUTABLE flag and prevent changes e.g. by adding if (!IS_INTERNED(str)) { in zend::String::rtrim().

==1094138==ERROR: AddressSanitizer: SEGV on unknown address 0x000007d91bb0 (pc 0x7fc0139b3b27 bp 0x7fc0091ead70 sp 0x7fc0091ead50 T0)
==1094138==The signal is caused by a WRITE memory access.
    #0 0x7fc0139b3b27 in zend::String::rtrim() /home/dmitry/php/swoole-src/ext-src/php_swoole_cxx.h:219
    #1 0x7fc0139d1848 in operator() /home/dmitry/php/swoole-src/ext-src/swoole_http_response.cc:340
    #2 0x7fc0139d1fa7 in swoole::http::Context::build_header(swoole::String*, char const*, unsigned long) /home/dmitry/php/swoole-src/ext-src/swoole_http_response.cc:383
    #3 0x7fc0139d4591 in swoole::http::Context::end(_zval_struct*, _zval_struct*) /home/dmitry/php/swoole-src/ext-src/swoole_http_response.cc:787
    #4 0x7fc0139d3cb7 in zim_swoole_http_response_end /home/dmitry/php/swoole-src/ext-src/swoole_http_response.cc:678
    #5 0x21373cd in ZEND_DO_FCALL_SPEC_RETVAL_UNUSED_HANDLER /home/dmitry/php/php8.2/Zend/zend_vm_execute.h:1842
    #6 0x23bf54a in execute_ex /home/dmitry/php/php8.2/Zend/zend_vm_execute.h:56064
    #7 0x7fc01397889a in swoole::PHPCoroutine::main_func(void*) /home/dmitry/php/swoole-src/ext-src/swoole_coroutine.cc:726
    #8 0x7fc013982da3 in void std::__invoke_impl<void, void (*&)(void*), void*>(std::__invoke_other, void (*&)(void*), void*&&) /usr/include/c++/12/bits/invoke.h:61
    #9 0x7fc013981779 in std::enable_if<std::__and_<std::is_void<void>, std::__is_invocable<void (*&)(void*), void*> >::value, void>::type std::__invoke_r<void, void (*&)(void*), void*>(void (*&)(void*), void*&&) /usr/include/c++/12/bits/invoke.h:154
    #10 0x7fc013980892 in std::_Function_handler<void (void*), void (*)(void*)>::_M_invoke(std::_Any_data const&, void*&&) /usr/include/c++/12/bits/std_function.h:290
    #11 0x7fc0139179ae in std::function<void (void*)>::operator()(void*) const /usr/include/c++/12/bits/std_function.h:591
    #12 0x7fc013ab8cd8 in swoole::coroutine::Context::context_func(void*) /home/dmitry/php/swoole-src/src/coroutine/context.cc:142
    #13 0x7fc013baf2b0 in swoole_make_fcontext (/home/dmitry/php/usr/php8.2-debug-64/lib/php/extensions/debug-non-zts-20220829/swoole.so+0x5af2b0)

I'm continue investigation...

dstogov · 2023-02-27T14:57:40Z

Thanks for the report and an attempt to fix this.

I can reproduce the crash, and I see that you patch fixes it, but it fixes only a side effect of the problem. I need to understand the real reason of the problem.

It looks like, this is related to some race condition that violates normal synchronization. This may be because of some swoole trick or may be a general php bug. It looks like different workers inconsistently update op_cache.run_time_cache__ptr or something CG(map_ptr...) related.

Can I limit the number of workes of hyperf.php?

bjzhjing · 2023-02-27T15:26:26Z

At first, I see a bug in swoole that is visible with opcache.protect_memory=1. Swoole changes data in opcache shared memory and this may cause unexpected behavior. It must check GC_IMMUTABLE flag and prevent changes e.g. by adding if (!IS_INTERNED(str)) { in zend::String::rtrim().

==1094138==ERROR: AddressSanitizer: SEGV on unknown address 0x000007d91bb0 (pc 0x7fc0139b3b27 bp 0x7fc0091ead70 sp 0x7fc0091ead50 T0)
==1094138==The signal is caused by a WRITE memory access.
    #0 0x7fc0139b3b27 in zend::String::rtrim() /home/dmitry/php/swoole-src/ext-src/php_swoole_cxx.h:219
    #1 0x7fc0139d1848 in operator() /home/dmitry/php/swoole-src/ext-src/swoole_http_response.cc:340
    #2 0x7fc0139d1fa7 in swoole::http::Context::build_header(swoole::String*, char const*, unsigned long) /home/dmitry/php/swoole-src/ext-src/swoole_http_response.cc:383
    #3 0x7fc0139d4591 in swoole::http::Context::end(_zval_struct*, _zval_struct*) /home/dmitry/php/swoole-src/ext-src/swoole_http_response.cc:787
    #4 0x7fc0139d3cb7 in zim_swoole_http_response_end /home/dmitry/php/swoole-src/ext-src/swoole_http_response.cc:678
    #5 0x21373cd in ZEND_DO_FCALL_SPEC_RETVAL_UNUSED_HANDLER /home/dmitry/php/php8.2/Zend/zend_vm_execute.h:1842
    #6 0x23bf54a in execute_ex /home/dmitry/php/php8.2/Zend/zend_vm_execute.h:56064
    #7 0x7fc01397889a in swoole::PHPCoroutine::main_func(void*) /home/dmitry/php/swoole-src/ext-src/swoole_coroutine.cc:726
    #8 0x7fc013982da3 in void std::__invoke_impl<void, void (*&)(void*), void*>(std::__invoke_other, void (*&)(void*), void*&&) /usr/include/c++/12/bits/invoke.h:61
    #9 0x7fc013981779 in std::enable_if<std::__and_<std::is_void<void>, std::__is_invocable<void (*&)(void*), void*> >::value, void>::type std::__invoke_r<void, void (*&)(void*), void*>(void (*&)(void*), void*&&) /usr/include/c++/12/bits/invoke.h:154
    #10 0x7fc013980892 in std::_Function_handler<void (void*), void (*)(void*)>::_M_invoke(std::_Any_data const&, void*&&) /usr/include/c++/12/bits/std_function.h:290
    #11 0x7fc0139179ae in std::function<void (void*)>::operator()(void*) const /usr/include/c++/12/bits/std_function.h:591
    #12 0x7fc013ab8cd8 in swoole::coroutine::Context::context_func(void*) /home/dmitry/php/swoole-src/src/coroutine/context.cc:142
    #13 0x7fc013baf2b0 in swoole_make_fcontext (/home/dmitry/php/usr/php8.2-debug-64/lib/php/extensions/debug-non-zts-20220829/swoole.so+0x5af2b0)

I'm continue investigation...

Thanks for showing the logs! Do you re-compile php with -fsanitize=address to get these details? Yes, I've noticed opcache.protect_memory=1 issue, but I'm not aware of using AddressSanitizer.

bjzhjing · 2023-02-27T15:34:35Z

Thanks for the report and an attempt to fix this.

I can reproduce the crash, and I see that you patch fixes it, but it fixes only a side effect of the problem. I need to understand the real reason of the problem.

It looks like, this is related to some race condition that violates normal synchronization. This may be because of some swoole trick or may be a general php bug. It looks like different workers inconsistently update op_cache.run_time_cache__ptr or something CG(map_ptr...) related.

Can I limit the number of workes of hyperf.php?

The number of workers Constant::OPTION_WORKER_NUM is configured in file hyperf-skeleton/config/autoload/server.php:

    'settings' => [
        Constant::OPTION_ENABLE_COROUTINE => true,
        Constant::OPTION_WORKER_NUM => 10,
        Constant::OPTION_PID_FILE => BASE_PATH . '/runtime/hyperf.pid',
        Constant::OPTION_OPEN_TCP_NODELAY => true,
        Constant::OPTION_MAX_COROUTINE => 100000,
        Constant::OPTION_OPEN_HTTP2_PROTOCOL => true,
        Constant::OPTION_MAX_REQUEST => 100000,
        Constant::OPTION_SOCKET_BUFFER_SIZE => 2 * 1024 * 1024,
        Constant::OPTION_BUFFER_OUTPUT_SIZE => 2 * 1024 * 1024,
        Constant::OPTION_OPEN_CPU_AFFINITY => true,
    ],

I change it as 10 in the above example. If the number is 2, there is no SIGSEGV issue, but if it's 3, the issue happens in my environment. It might be a race condition problem.

dstogov · 2023-02-27T17:15:27Z

The real bug, should be fixed by #10719
Thank you very much for this test case.
I'm going to merge it after additional testing.
If you may check the fix, please confirm it fixes this problem.

dstogov · 2023-02-27T18:42:41Z

The fix is merged into PHP-8.2 via 2e3fc8c
Please test and reopen the issue if you see problem. In my local tests everything works fine now.

bjzhjing · 2023-02-28T03:18:32Z

@dstogov Verfied, the fix works well! Cool! I've also doubted the inheritance cache, but I'm not familiar with the code part :-)
Can I ask some questions regarding this? If inheritance cache gets NULL, does it mean func->op_array->run_time_cache__ptr is NULL? and why EX:RX->func->op_array->run_time_cache__ptr is not NULL? I think it's not, otherwise, my change can not avoid this issue.

dstogov · 2023-02-28T05:57:17Z

If inheritance cache gets NULL, does it mean func->op_array->run_time_cache__ptr is NULL? and why EX:RX->func->op_array->run_time_cache__ptr is not NULL? I think it's not, otherwise, my change can not avoid this issue.

The connection between run_time_cache__ptr and inheritance cache is not direct. Actually, because of a bug, on some race condition we might get several copies of the same linked class. Then JIT-ed code compiled by different worker and having different copies in mind, initialized methods of one copy but then expected the other. This is what you had found. op_array(s) used durig JIT compilation and during execution got different values of run_time_cache__ptr.

bjzhjing · 2023-02-28T06:51:05Z

If inheritance cache gets NULL, does it mean func->op_array->run_time_cache__ptr is NULL? and why EX:RX->func->op_array->run_time_cache__ptr is not NULL? I think it's not, otherwise, my change can not avoid this issue.

The connection between run_time_cache__ptr and inheritance cache is not direct. Actually, because of a bug, on some race condition we might get several copies of the same linked class. Then JIT-ed code compiled by different worker and having different copies in mind, initialized methods of one copy but then expected the other. This is what you had found. op_array(s) used durig JIT compilation and during execution got different values of run_time_cache__ptr.

So there is indeed a race condition issue, right? Is it in swoole or php? I've ever tried to enable ZTS, but the issue still exists in that case. Acutally, I'm a bit confused here, if there is still race condition, why you say the change in inheritance cache is the real fix for this one? Please forgive if you think I have too many questions, I'm curious about this but I'm not so familiar with the code part.

dstogov · 2023-02-28T07:28:32Z

So there is indeed a race condition issue, right? Is it in swoole or php? I've ever tried to enable ZTS, but the issue still exists in that case. Acutally, I'm a bit confused here, if there is still race condition, why you say the change in inheritance cache is the real fix for this one? Please forgive if you think I have too many questions, I'm curious about this but I'm not so familiar with the code part.

The error was in PHP.

The inheritance cache should keep a single copy of each linked class. It first checks if we already have a linked class in the cache, than obtains a lock, checks the cache once again (because different worker might cache it at the same time), than stores new linked class in the cache, and releases the lock. Unfortunately, we had a bug at the second check and this leaded to problems.

bjzhjing · 2023-02-28T08:08:11Z

The inheritance cache should keep a single copy of each linked class. It first checks if we already have a linked class in the cache, than obtains a lock, checks the cache once again (because different worker might cache it at the same time), than stores new linked class in the cache, and releases the lock. Unfortunately, we had a bug at the second check and this leaded to problems.

Ok, I see, so the bug mentioned here is what you fixed for this issue. I' ve got a general idea for the process. Thanks a bunch for the explanation!

github-actions bot added the Extension: opcache label Feb 22, 2023

Girgias added the Category: JIT label Feb 22, 2023

Girgias requested a review from dstogov February 22, 2023 14:28

dstogov mentioned this pull request Feb 27, 2023

JIT segmentation fault in PHP 8.1 #7817

Open

dstogov closed this Feb 27, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

JIT: Fix SIGSEGV issue while running hyperf-skeleton #10658

JIT: Fix SIGSEGV issue while running hyperf-skeleton #10658

Uh oh!

bjzhjing commented Feb 22, 2023 •

edited

Loading

Uh oh!

dstogov commented Feb 22, 2023

Uh oh!

bjzhjing commented Feb 23, 2023

Uh oh!

bjzhjing commented Feb 23, 2023 •

edited

Loading

Uh oh!

bjzhjing commented Feb 27, 2023

Uh oh!

dstogov commented Feb 27, 2023

Uh oh!

dstogov commented Feb 27, 2023

Uh oh!

bjzhjing commented Feb 27, 2023

Uh oh!

bjzhjing commented Feb 27, 2023 •

edited

Loading

Uh oh!

dstogov commented Feb 27, 2023

Uh oh!

dstogov commented Feb 27, 2023

Uh oh!

bjzhjing commented Feb 28, 2023

Uh oh!

dstogov commented Feb 28, 2023 •

edited

Loading

Uh oh!

bjzhjing commented Feb 28, 2023

Uh oh!

dstogov commented Feb 28, 2023

Uh oh!

bjzhjing commented Feb 28, 2023

Uh oh!

Uh oh!

JIT: Fix SIGSEGV issue while running hyperf-skeleton #10658

JIT: Fix SIGSEGV issue while running hyperf-skeleton #10658

Uh oh!

Conversation

bjzhjing commented Feb 22, 2023 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

dstogov commented Feb 22, 2023

Uh oh!

bjzhjing commented Feb 23, 2023

Uh oh!

bjzhjing commented Feb 23, 2023 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

bjzhjing commented Feb 27, 2023

Uh oh!

dstogov commented Feb 27, 2023

Uh oh!

dstogov commented Feb 27, 2023

Uh oh!

bjzhjing commented Feb 27, 2023

Uh oh!

bjzhjing commented Feb 27, 2023 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

dstogov commented Feb 27, 2023

Uh oh!

dstogov commented Feb 27, 2023

Uh oh!

bjzhjing commented Feb 28, 2023

Uh oh!

dstogov commented Feb 28, 2023 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

bjzhjing commented Feb 28, 2023

Uh oh!

dstogov commented Feb 28, 2023

Uh oh!

bjzhjing commented Feb 28, 2023

Uh oh!

Uh oh!

bjzhjing commented Feb 22, 2023 •

edited

Loading

bjzhjing commented Feb 23, 2023 •

edited

Loading

bjzhjing commented Feb 27, 2023 •

edited

Loading

dstogov commented Feb 28, 2023 •

edited

Loading