Skip to content

Refactor BCMath _bc_do_sub #14132

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 10 commits into from
May 7, 2024

Conversation

SakiTakamachi
Copy link
Member

@SakiTakamachi SakiTakamachi commented May 4, 2024

I thought about separating SIMD and other changes into separate PRs, but it would be slow to measure them individually, so I combined them into one. There seems to be a synergistic effect.

Measure using hyperfine. Warm-up is 10 times.

Cases

1:

$num1 = '1.2345678901234567890';
$num2 = '-2.12345678901234567890';

for ($i = 0; $i < 4000000; $i++) {
    bcadd($num1, $num2, 20);
}

2:

$num1 = '12345678901234567890.12345678901234567890';
$num2 = '-212345678901234567890.12345678901234567890';

for ($i = 0; $i < 4000000; $i++) {
    bcadd($num1, $num2, 20);
}

3:

$num1 = '12345678901234567890.00000000000000000000000000000000000000000000000000';
$num2 = '-212345678901234567890.00000000000000000000000000000000000000000000000000';

for ($i = 0; $i < 4000000; $i++) {
    bcadd($num1, $num2, 20);
}

compare

before results

// 1
Time (mean ± σ):     600.0 ms ±  11.6 ms    [User: 594.6 ms, System: 4.7 ms]
Range (min … max):   587.2 ms … 620.7 ms    10 runs

// 2
Time (mean ± σ):     718.3 ms ±   3.9 ms    [User: 713.8 ms, System: 3.6 ms]
Range (min … max):   713.4 ms … 722.7 ms    10 runs

// 3
Time (mean ± σ):     762.9 ms ±  16.6 ms    [User: 759.0 ms, System: 3.1 ms]
Range (min … max):   750.9 ms … 808.1 ms    10 runs

after calculate min_len + min_scale first (final state)

// 1
Time (mean ± σ):     585.4 ms ±   7.7 ms    [User: 580.7 ms, System: 3.9 ms]
Range (min … max):   573.7 ms … 598.4 ms    10 runs
 
// 2
Time (mean ± σ):     626.0 ms ±   5.3 ms    [User: 621.1 ms, System: 4.3 ms]
Range (min … max):   619.5 ms … 634.2 ms    10 runs
 
// 3
Time (mean ± σ):     713.3 ms ±   6.5 ms    [User: 708.6 ms, System: 3.9 ms]
Range (min … max):   700.9 ms … 720.5 ms    10 runs

Measurement results for each commit

after SIMD

// 1
Time (mean ± σ):     610.2 ms ±  18.5 ms    [User: 606.1 ms, System: 3.3 ms]
Range (min … max):   589.2 ms … 635.1 ms    10 runs
 
// 2
Time (mean ± σ):     657.8 ms ±   8.9 ms    [User: 653.6 ms, System: 3.3 ms]
Range (min … max):   646.3 ms … 675.9 ms    10 runs
 
// 3
Time (mean ± σ):     772.0 ms ±   8.8 ms    [User: 767.9 ms, System: 3.3 ms]
Range (min … max):   758.4 ms … 787.5 ms    10 runs

after refactor initialization

// 1
Time (mean ± σ):     592.6 ms ±   3.5 ms    [User: 588.3 ms, System: 3.5 ms]
Range (min … max):   589.3 ms … 600.7 ms    10 runs
 
// 2
Time (mean ± σ):     691.1 ms ±  55.8 ms    [User: 687.3 ms, System: 3.0 ms]
Range (min … max):   647.6 ms … 812.9 ms    10 runs
 
// 3
Time (mean ± σ):     775.6 ms ±  11.2 ms    [User: 771.6 ms, System: 3.2 ms]
Range (min … max):   762.8 ms … 801.7 ms    10 runs

after use EXPECTED

// 1
Time (mean ± σ):     595.0 ms ±   4.4 ms    [User: 590.6 ms, System: 3.6 ms]
Range (min … max):   588.1 ms … 603.0 ms    10 runs
 
// 2
Time (mean ± σ):     660.8 ms ±  16.2 ms    [User: 656.3 ms, System: 3.7 ms]
Range (min … max):   642.5 ms … 701.4 ms    10 runs
 
// 3
Time (mean ± σ):     770.1 ms ±   4.3 ms    [User: 766.1 ms, System: 3.2 ms]
Range (min … max):   763.9 ms … 777.5 ms    10 runs

FYI: without SIMD

// 1
Time (mean ± σ):     669.6 ms ±   5.0 ms    [User: 665.3 ms, System: 3.4 ms]
Range (min … max):   664.6 ms … 679.9 ms    10 runs
 
// 2
Time (mean ± σ):     774.0 ms ±  10.1 ms    [User: 770.0 ms, System: 3.2 ms]
Range (min … max):   761.1 ms … 792.2 ms    10 runs

// 3
Time (mean ± σ):     802.5 ms ±  13.4 ms    [User: 797.5 ms, System: 4.2 ms]
Range (min … max):   789.2 ms … 832.6 ms    10 runs

@nielsdos nielsdos self-requested a review May 4, 2024 14:17
Copy link
Member

@nielsdos nielsdos left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This looks pretty good.
The approach is quite smart. I tried something similar for addition as I said in private chat.
Something similar to this can be implemented for addition, and then we can compare our approaches to see which one is fastest.
I just have some nits / questions.

if (diff_len != min_len) {
for (count = diff_len - min_len; count > 0; count--) {
if (n1->n_len != n2->n_len) {
for (count = n1->n_len - n2->n_len; count > 0; count--) {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Previously we had

diff_len = MAX(n1->n_len, n2->n_len);
min_len = MIN(n1->n_len, n2->n_len);

If n1->n_len was the larger of the two, then diff_len - min_len == n1->n_len - n2->n_len; otherwise diff_len - min_len == n2->n_len - n1->n_len.
So the change on line 230 is right, but I don't think that count = n1->n_len - n2->n_len; is right? Am I missing something?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If look at the caller of this, we can see that n1 is guaranteed to be the larger value :)

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I see, n1 is assumed to be larger than n2.
Please put a ZEND_ASSERT in any case, just to be safe.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ZEND_ASSERT is slower, so I combined EXPECTED with comparison.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

How can ZEND_ASSERT be slower? It is only a thing in a debug build

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think so too, but in my environment it actually slows down in release builds....

@@ -182,7 +171,51 @@ bc_num _bc_do_sub(bc_num n1, bc_num n2, size_t scale_min)
}

/* Now do the equal length scale and integer parts. */
for (count = 0; count < min_len + min_scale; count++) {
count = 0;
if (min_bytes > 8) {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
if (min_bytes > 8) {
if (min_bytes >= 8) {

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

When it was exactly 8 bytes, I didn't include the equal sign because I thought the overhead would be large for parallel operations. Do you think the overhead is small?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hmm, we should measure that.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Including equals was often slightly faster. (Almost unchanged)

diffptr++;
n1ptr++;
n2ptr++;
while (count + sizeof(uint64_t) < min_bytes) {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
while (count + sizeof(uint64_t) < min_bytes) {
while (count + sizeof(uint64_t) <= min_bytes) {

@nielsdos
Copy link
Member

nielsdos commented May 4, 2024

Travis failure is legit: src/doaddsub.c:207: undefined reference to BSWAP64'`
Also, I think the code only works on little endian, so you should probably guard that.

@SakiTakamachi
Copy link
Member Author

Travis failure is legit: src/doaddsub.c:207: undefined reference to BSWAP64'`
Also, I think the code only works on little endian, so you should probably guard that.

You're right, I haven't yet gotten that far with the code, but I am aware of the problem.

Copy link
Member

@Girgias Girgias left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I will just say that I really do not understand any of these byte manipulation tricks. I don't mind these things to be committed, but one of you should mark themselves on the CODEOWNER file for BCMath, as I will be unable to review any such changes to the extension.

Comment on lines 94 to 103
# elif defined(__GNUC__) && ( \
__GNUC__ > 4 || ( \
__GNUC__ == 4 && ( \
__GNUC_MINOR__ >= 3 \
) \
) \
)
# define BSWAP32(u) __builtin_bswap32(u)
# define BSWAP64(u) __builtin_bswap64(u)
# endif // __has_builtin
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is this really worthwhile? GCC 4.4.3 is old (January 2010, over 14y ago), and convoluting the code like this for an optimization doesn't look amazing to me.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I didn't check it properly because I copied it directly from ext/hash. thank you.

@SakiTakamachi SakiTakamachi force-pushed the refactor_bcmath_add_sub branch 2 times, most recently from c05b269 to 294fbbd Compare May 5, 2024 12:16
Moved SWAR_ONES and SWAR_REPEAT to header, and defined bytes swap etc.
@SakiTakamachi SakiTakamachi force-pushed the refactor_bcmath_add_sub branch 4 times, most recently from 077989c to 9824755 Compare May 5, 2024 13:03
@SakiTakamachi SakiTakamachi changed the title [WIP] Refactor BCMath sub Refactor BCMath _bc_do_sub May 5, 2024
@SakiTakamachi SakiTakamachi marked this pull request as ready for review May 5, 2024 13:11
Comment on lines 129 to 131
size_t diff_len = EXPECTED(n1->n_len >= n2->n_len) ? n1->n_len : n2->n_len;
size_t min_scale = MIN(n1->n_scale, n2->n_scale);
size_t min_len = n1->n_len >= n2->n_len ? n2->n_len : n1->n_len;
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Interestingly, using EXPECTED in both places is slow.

@@ -166,7 +163,56 @@ bc_num _bc_do_sub(bc_num n1, bc_num n2, size_t scale_min)
}

/* Now do the equal length scale and integer parts. */
for (count = 0; count < min_len + min_scale; count++) {
size_t sub_count = 0;
Copy link
Member Author

@SakiTakamachi SakiTakamachi May 5, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Analyzing the measurement results, it seems that sharing the count here and resetting it once is more expensive than using a separate variable.

Perhaps it's a scope issue?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That seems odd, did you check via hyperfine? Perhaps that's just coincidental noise, hyperfine can filter that out.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think I measured it using hyperfine, but I'll separate the commits and measure them again individually.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not much changed. In fact, it was a little late. Therefore, I will not make this change.

_bc_do_sub has been modified to use SIMD to perform faster calculations.
@SakiTakamachi SakiTakamachi force-pushed the refactor_bcmath_add_sub branch from 9824755 to 670eada Compare May 5, 2024 13:34
@SakiTakamachi SakiTakamachi requested a review from nielsdos May 5, 2024 13:35
Copy link
Member

@nielsdos nielsdos left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Some general remarks:

  • Please don't force push, it makes it more difficult to follow the changes. It's fine to have fixup commits because the history can be cleaned up after approval before merging.
  • We should start benchmarking using hyperfine to make sure we don't measure noise. With the runtime becoming much lower noise becomes a greater problem. To do that I recommend removing the timing code from your benchmark script and also reduce the number of iterations so that it executes in less than a second.

@@ -179,7 +226,7 @@ bc_num _bc_do_sub(bc_num n1, bc_num n2, size_t scale_min)

/* If n1 has more digits than n2, we now do that subtract. */
if (diff_len != min_len) {
for (count = diff_len - min_len; count > 0; count--) {
for (size_t count = diff_len - min_len; count > 0; count--) {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why did you change this back to what it was previously? Did the change have negative performance impact?
In any case, I don't think changes like these belong to a SIMD PR anyway and should be done separately.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah, indeed. Separate the commits and measure each one.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I fixed it

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why did you change this back to what it was previously? Did the change have negative performance impact?

Combining EXPECTED with a comparison was faster than using ZEND_ASSERT.

n2bytes = BC_BSWAP(n2bytes);
#endif

n1bytes -= (n2bytes + borrow);
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You don't need the parentheses here

Suggested change
n1bytes -= (n2bytes + borrow);
n1bytes -= n2bytes + borrow;

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

fixed it


n1bytes -= (n2bytes + borrow);
/* If the most significant 4 bits of the 8 bytes are not 0, a carry-down has occurred. */
bool tmp_borrow = n1bytes >= ((BC_UINT_T) 0x10 << (8 * (sizeof(BC_UINT_T) - 1)));
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

On first glance this can be simplified I think. Isn't it sufficient to only check the most significant bit instead? If that's correct, then it would be a faster check:

Suggested change
bool tmp_borrow = n1bytes >= ((BC_UINT_T) 0x10 << (8 * (sizeof(BC_UINT_T) - 1)));
bool tmp_borrow = n1bytes & ((BC_UINT_T) 1 << (8 * sizeof(BC_UINT_T) - 1));

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is good idea. Fixed it

* calculation, so 6 is subtracted.
* Also, set all upper 4 bits to 0.
*/
BC_UINT_T borrow_mask = (((n1bytes | (n1bytes >> 1) | (n1bytes >> 2) | (n1bytes >> 3)) & SWAR_REPEAT(0x10)) * 0x06) >> 4;
Copy link
Member

@nielsdos nielsdos May 5, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not sure the following would be faster in practice, this should be measured, but what do you think of this. It avoids an additional shift.

Suggested change
BC_UINT_T borrow_mask = (((n1bytes | (n1bytes >> 1) | (n1bytes >> 2) | (n1bytes >> 3)) & SWAR_REPEAT(0x10)) * 0x06) >> 4;
BC_UINT_T borrow_mask = n1bytes | (n1bytes >> 1);
borrow_mask |= borrow_mask >> 2;
borrow_mask = ((borrow_mask & SWAR_REPEAT(0x10)) * 0x06) >> 4;

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hmm and actually, doesn't the same thing hold true here. Why do we have to check the top 4 bits, isn't just the top bit enough?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Indeed. I didn't notice that, thanks!

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

fixed it

@@ -166,7 +163,56 @@ bc_num _bc_do_sub(bc_num n1, bc_num n2, size_t scale_min)
}

/* Now do the equal length scale and integer parts. */
for (count = 0; count < min_len + min_scale; count++) {
size_t sub_count = 0;
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That seems odd, did you check via hyperfine? Perhaps that's just coincidental noise, hyperfine can filter that out.

@@ -78,6 +78,66 @@ typedef struct bc_struct {
#define LONG_MAX 0x7ffffff
#endif

/* This will be 0x01010101 for 32-bit and 0x0101010101010101 for 64-bit */
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Perhaps this swar and byte magic stuff should be in a separate header, since it's really internal stuff. I'm not entirely sure though.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is it ok to duplicate this in each file, or should I have a separate header file?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think a new internal BC Math header is better, such headers exist for various extensions.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thx, do you have a good name idea?

my ideas:

  • bcmath_utils.h
  • bcmath_ex.h
  • internal_bcmath.h

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The PDO drivers and GMP use the *_int.h convention, Phar and ext/com use the *_internal.h convention, MySQLi and MySQLnd use the *_priv.h convention, and ext/filter and ext/curl use the *_private.h convention.

I think thus either using bcmath_internal.h or bcmath_private.h for the name is the best idea.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I got it.
I just noticed that private.h already exists in bcmath, is it reasonable to rename it to bcmath_private.h and use this? Or do you think it's better to create a new bcmath_internal.h since that is just for libbcmath?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Aren't these new routines just used within libbcmath? In which case reusing the existing private.h header makes sense to me

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Okay, thx!

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I moved some constants and macros to private.h

@SakiTakamachi
Copy link
Member Author

I divided it into several commits and wrote the measurement results for each in the explanation.

@SakiTakamachi SakiTakamachi requested a review from nielsdos May 6, 2024 01:18
@SakiTakamachi SakiTakamachi force-pushed the refactor_bcmath_add_sub branch from ec062b2 to 2b6cbad Compare May 6, 2024 03:53
@SakiTakamachi
Copy link
Member Author

I forgot to edit some of my comments, I'll do that later.

@SakiTakamachi
Copy link
Member Author

done

Copy link
Member

@nielsdos nielsdos left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nice work!
I'm not quite convinced about the strange behaviour around EXPECTED making things slower... I think what you're seeing is noise by recompilation: sometimes when you recompile you get a small performance difference (both positive and negative) even if nothing really changed in the assembly. This is often because of code layout changes.
Anyway, your changes on that code don't really hurt and the SIMD makes a measurable improvement.
It is possible to do something similar for the addition.

@SakiTakamachi SakiTakamachi merged commit 0273200 into php:master May 7, 2024
9 of 10 checks passed
@SakiTakamachi SakiTakamachi deleted the refactor_bcmath_add_sub branch May 7, 2024 02:39
@kocsismate
Copy link
Member

kocsismate commented May 16, 2024

I managed to perf test this just now with the following results: https://gist.github.com/kocsismate/30116461d7d2c3cd7363254771626c38

Saki's code is consistently faster than the commit in master which her branch is based on :) The benchmarked code is basically the same one as mentioned above (I only measured the 1st and 2nd cases)

For transparency, here are the 2 scripts (I made their output compatible with bench.php and micro_bench.php so that my tooling can parse them):

<?php

// 1st case

$num1 = '1.2345678901234567890';
$num2 = '-2.12345678901234567890';

$hrtime1 = hrtime();

for ($i = 0; $i < 4000000; $i++) {
    bcadd($num1, $num2, 20);
}

$hrtime2 = hrtime();

$total = (($hrtime2[0]*1000000000 + $hrtime2[1]) / 1000000000) - (($hrtime1[0]*1000000000 + $hrtime1[1]) / 1000000000);

$pad = str_repeat("-", 27);
echo $pad."\n";
$num = number_format($total,6);
$pad = str_repeat(" ", 27-strlen("Total")-strlen($num));
echo "Total".$pad.$num."\n";

// 2nd case

<?php

$num1 = '12345678901234567890.12345678901234567890';
$num2 = '-212345678901234567890.12345678901234567890';

$hrtime1 = hrtime();

for ($i = 0; $i < 4000000; $i++) {
    bcadd($num1, $num2, 20);
}

$hrtime2 = hrtime();

$total = (($hrtime2[0]*1000000000 + $hrtime2[1]) / 1000000000) - (($hrtime1[0]*1000000000 + $hrtime1[1]) / 1000000000);

$pad = str_repeat("-", 27);
echo $pad."\n";
$num = number_format($total,6);
$pad = str_repeat(" ", 27-strlen("Total")-strlen($num));
echo "Total".$pad.$num."\n";

@SakiTakamachi
Copy link
Member Author

Thanks!!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants