Skip to content

ext/bcmath: Use SIMD for trailing zero counts during conversion #14166

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 7 commits into from
May 9, 2024

Conversation

SakiTakamachi
Copy link
Member

benchmark: #14132

before

// 1
Time (mean ± σ):     589.3 ms ±  12.8 ms    [User: 584.2 ms, System: 4.3 ms]
Range (min … max):   574.5 ms … 614.4 ms    10 runs
 
// 2
Time (mean ± σ):     638.1 ms ±   9.2 ms    [User: 634.1 ms, System: 3.2 ms]
Range (min … max):   628.0 ms … 658.0 ms    10 runs
 
// 3
Time (mean ± σ):     713.0 ms ±   6.3 ms    [User: 709.0 ms, System: 3.2 ms]
Range (min … max):   704.6 ms … 724.2 ms    10 runs

Final state (after removing unnecessary code)

// 1
Time (mean ± σ):     566.5 ms ±   4.7 ms    [User: 563.4 ms, System: 2.4 ms]
Range (min … max):   558.6 ms … 572.7 ms    10 runs
 
// 2
Time (mean ± σ):     603.4 ms ±   6.1 ms    [User: 599.2 ms, System: 3.5 ms]
Range (min … max):   594.0 ms … 613.6 ms    10 runs
 
// 3
Time (mean ± σ):     583.3 ms ±   8.0 ms    [User: 579.3 ms, System: 3.3 ms]
Range (min … max):   568.0 ms … 595.1 ms    10 runs

after SIMD

// 1
Time (mean ± σ):     591.4 ms ±   7.6 ms    [User: 587.5 ms, System: 3.1 ms]
Range (min … max):   579.6 ms … 605.6 ms    10 runs
 
// 2
Time (mean ± σ):     650.3 ms ±   7.5 ms    [User: 644.8 ms, System: 4.7 ms]
Range (min … max):   642.0 ms … 667.1 ms    10 runs
 
// 3
Time (mean ± σ):     618.2 ms ±  14.3 ms    [User: 614.2 ms, System: 3.1 ms]
Range (min … max):   602.2 ms … 642.1 ms    10 runs

after UNEXPECTED

// 1
Time (mean ± σ):     572.0 ms ±   8.0 ms    [User: 567.5 ms, System: 3.7 ms]
Range (min … max):   560.7 ms … 581.6 ms    10 runs
 
// 2
Time (mean ± σ):     603.6 ms ±   7.0 ms    [User: 599.9 ms, System: 3.1 ms]
Range (min … max):   594.1 ms … 616.5 ms    10 runs
 
// 3
Time (mean ± σ):     584.8 ms ±  13.2 ms    [User: 580.2 ms, System: 3.8 ms]
Range (min … max):   574.2 ms … 615.4 ms    10 runs

FYI: without SIMD

// 1
Time (mean ± σ):     604.4 ms ±  23.9 ms    [User: 599.6 ms, System: 4.0 ms]
Range (min … max):   588.7 ms … 669.7 ms    10 runs
 
// 2
Time (mean ± σ):     658.2 ms ±  13.6 ms    [User: 654.8 ms, System: 2.7 ms]
Range (min … max):   644.0 ms … 692.3 ms    10 runs
 
// 3
Time (mean ± σ):     789.0 ms ±   8.6 ms    [User: 784.5 ms, System: 3.7 ms]
Range (min … max):   779.7 ms … 809.5 ms    10 runs

@SakiTakamachi SakiTakamachi force-pushed the refactor_bcmath_str2num branch from e55e0e2 to fc7f7cb Compare May 7, 2024 13:44
Comment on lines 92 to 96
if (EXPECTED(mask != 0xffff)) {
/* Move the pointer back and check each character in loop. */
str += sizeof(__m128i);
break;
}
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I can also use code like the following, but a while loop has always been faster. This may be because the number of calculations increases by one.

return str + sizeof(__m128i) - __builtin_clz(~mask);

@SakiTakamachi SakiTakamachi marked this pull request as ready for review May 7, 2024 13:54
Copy link
Member

@nielsdos nielsdos left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It feels a bit weird to optimize for something that (hopefully) shouldn't happen a lot. I see a slight performance decrease for benchmark 2, a small increase in bench 1 and a huge increase in bench 3. Do we think trailing zeros is common?
Note though that I am completely fine with removing the ineffective code and using UNEXPECTED.

{
/* Check in bulk */
#ifdef __SSE2__
const __m128i c_zero_repeat = _mm_set1_epi8((signed char) '0');
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Casting this to signed char shouldn't be necessary.

@@ -76,6 +76,35 @@ static const char *bc_count_digits(const char *str, const char *end)
return str;
}

static inline const char *bc_skip_zero_reverse(const char *str, const char *end)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The argument names are swapped, which makes it very confusing.

@SakiTakamachi
Copy link
Member Author

SakiTakamachi commented May 7, 2024

@nielsdos

Does this patch mean that Benchmark 2 is a bit slower in your environment?

Could it be that the speedup in my measurements with this patch is due to the use of UNEXPECTED and the removal of unnecessary code, and that SIMD has a negative effect on patch 2?
(I am concerned that, as you said before, measurements may be faster or slower immediately after compilation.)

(edit)

Or maybe the order of the measurements in the description is confusing? The first is before applying the patch, the second is the final state, and the rest are commit units, so you should compare the first and second.

@SakiTakamachi
Copy link
Member Author

Do we think trailing zeros is common?

Trailing zeros are probably quite common given the use cases for BCMath, but 16 decimal digits is probably quite rare.

I've opened a PR on this as it improved performance in all cases in my environment, but if not, I wouldn't be picky about using SIMD here.

@nielsdos
Copy link
Member

nielsdos commented May 8, 2024

These are the results I'm getting:

Benchmark 1: ./sapi/cli/php 1.php
  Time (mean ± σ):     468.3 ms ±   9.7 ms    [User: 463.6 ms, System: 1.9 ms]
  Range (min … max):   457.7 ms … 486.1 ms    10 runs
 
Benchmark 2: ./sapi/cli/php_old 1.php
  Time (mean ± σ):     450.9 ms ±   3.6 ms    [User: 448.1 ms, System: 2.5 ms]
  Range (min … max):   446.3 ms … 457.0 ms    10 runs
 
Summary
  ./sapi/cli/php_old 1.php ran
    1.04 ± 0.02 times faster than ./sapi/cli/php 1.php
Benchmark 1: ./sapi/cli/php 2.php
  Time (mean ± σ):     535.1 ms ±  18.0 ms    [User: 531.7 ms, System: 2.8 ms]
  Range (min … max):   517.8 ms … 578.0 ms    10 runs
 
Benchmark 2: ./sapi/cli/php_old 2.php
  Time (mean ± σ):     527.9 ms ±  12.1 ms    [User: 525.9 ms, System: 1.5 ms]
  Range (min … max):   517.1 ms … 552.8 ms    10 runs
 
Summary
  ./sapi/cli/php_old 2.php ran
    1.01 ± 0.04 times faster than ./sapi/cli/php 2.php
Benchmark 1: ./sapi/cli/php 3.php
  Time (mean ± σ):     496.5 ms ±   8.1 ms    [User: 493.8 ms, System: 2.2 ms]
  Range (min … max):   490.5 ms … 515.1 ms    10 runs
 
Benchmark 2: ./sapi/cli/php_old 3.php
  Time (mean ± σ):     613.2 ms ±  19.1 ms    [User: 610.5 ms, System: 2.2 ms]
  Range (min … max):   602.9 ms … 666.7 ms    10 runs

Summary
  ./sapi/cli/php 3.php ran
    1.24 ± 0.04 times faster than ./sapi/cli/php_old 3.php

Copy link
Member

@nielsdos nielsdos left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm fine with accepting this, the degradation for 2.php isn't severe and there are improvements for the other cases.
I'm fine with the argument names on second thought, but please remove the redundant case upon merging. Thanks.

@SakiTakamachi
Copy link
Member Author

Thx, as #14180, I will prepare a more stable benchmark environment and try measuring again.

@SakiTakamachi
Copy link
Member Author

SakiTakamachi commented May 9, 2024

@nielsdos
I also merged the latest master to this and compared with EC2.

master:

hyperfine "php 1.php" --warmup 10
Time (mean ± σ):     654.7 ms ±   3.0 ms    [User: 650.5 ms, System: 2.6 ms]
Range (min … max):   650.1 ms … 659.2 ms    10 runs

hyperfine "php 2.php" --warmup 10
Time (mean ± σ):     769.4 ms ±   5.4 ms    [User: 765.8 ms, System: 2.0 ms]
Range (min … max):   762.6 ms … 781.8 ms    10 runs

hyperfine "php 3.php" --warmup 10
Time (mean ± σ):     910.3 ms ±  13.3 ms    [User: 905.6 ms, System: 2.8 ms]
Range (min … max):   896.8 ms … 934.6 ms    10 runs

php old.php // my old bench
1.6298861503601
1.9048039913177
2.2188358306885

this branch:

hyperfine "php 1.php" --warmup 10
Time (mean ± σ):     643.6 ms ±   6.5 ms    [User: 638.7 ms, System: 3.2 ms]
Range (min … max):   637.0 ms … 656.7 ms    10 runs

hyperfine "php 2.php" --warmup 10
Time (mean ± σ):     749.7 ms ±   6.4 ms    [User: 745.4 ms, System: 2.4 ms]
Range (min … max):   742.3 ms … 766.6 ms    10 runs

hyperfine "php 3.php" --warmup 10
Time (mean ± σ):     684.7 ms ±  10.7 ms    [User: 680.8 ms, System: 2.5 ms]
Range (min … max):   673.4 ms … 707.8 ms    10 runs

php old.php // my old bench
1.5792031288147
1.8460278511047
1.6792199611664

@SakiTakamachi
Copy link
Member Author

I removed the unnecessary cast and changed the variable name slightly. If the variable names are okay, merge this.

@SakiTakamachi SakiTakamachi force-pushed the refactor_bcmath_str2num branch from 03bc6bb to 323e144 Compare May 9, 2024 00:23
@nielsdos
Copy link
Member

nielsdos commented May 9, 2024

Scanner is written with double n

@SakiTakamachi
Copy link
Member Author

Thanks, I didn't notice at all

@SakiTakamachi SakiTakamachi merged commit 1a3d870 into php:master May 9, 2024
10 checks passed
@SakiTakamachi SakiTakamachi deleted the refactor_bcmath_str2num branch May 9, 2024 10:24
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants