Refactor BCMath `_bc_do_sub` #14132

SakiTakamachi · 2024-05-04T14:09:51Z

I thought about separating SIMD and other changes into separate PRs, but it would be slow to measure them individually, so I combined them into one. There seems to be a synergistic effect.

Measure using hyperfine. Warm-up is 10 times.

Cases

1:

$num1 = '1.2345678901234567890';
$num2 = '-2.12345678901234567890';

for ($i = 0; $i < 4000000; $i++) {
    bcadd($num1, $num2, 20);
}

2:

$num1 = '12345678901234567890.12345678901234567890';
$num2 = '-212345678901234567890.12345678901234567890';

for ($i = 0; $i < 4000000; $i++) {
    bcadd($num1, $num2, 20);
}

3:

$num1 = '12345678901234567890.00000000000000000000000000000000000000000000000000';
$num2 = '-212345678901234567890.00000000000000000000000000000000000000000000000000';

for ($i = 0; $i < 4000000; $i++) {
    bcadd($num1, $num2, 20);
}

compare

before results

// 1
Time (mean ± σ):     600.0 ms ±  11.6 ms    [User: 594.6 ms, System: 4.7 ms]
Range (min … max):   587.2 ms … 620.7 ms    10 runs

// 2
Time (mean ± σ):     718.3 ms ±   3.9 ms    [User: 713.8 ms, System: 3.6 ms]
Range (min … max):   713.4 ms … 722.7 ms    10 runs

// 3
Time (mean ± σ):     762.9 ms ±  16.6 ms    [User: 759.0 ms, System: 3.1 ms]
Range (min … max):   750.9 ms … 808.1 ms    10 runs

after calculate `min_len + min_scale first` (final state)

// 1
Time (mean ± σ):     585.4 ms ±   7.7 ms    [User: 580.7 ms, System: 3.9 ms]
Range (min … max):   573.7 ms … 598.4 ms    10 runs
 
// 2
Time (mean ± σ):     626.0 ms ±   5.3 ms    [User: 621.1 ms, System: 4.3 ms]
Range (min … max):   619.5 ms … 634.2 ms    10 runs
 
// 3
Time (mean ± σ):     713.3 ms ±   6.5 ms    [User: 708.6 ms, System: 3.9 ms]
Range (min … max):   700.9 ms … 720.5 ms    10 runs

Measurement results for each commit

after SIMD

// 1
Time (mean ± σ):     610.2 ms ±  18.5 ms    [User: 606.1 ms, System: 3.3 ms]
Range (min … max):   589.2 ms … 635.1 ms    10 runs
 
// 2
Time (mean ± σ):     657.8 ms ±   8.9 ms    [User: 653.6 ms, System: 3.3 ms]
Range (min … max):   646.3 ms … 675.9 ms    10 runs
 
// 3
Time (mean ± σ):     772.0 ms ±   8.8 ms    [User: 767.9 ms, System: 3.3 ms]
Range (min … max):   758.4 ms … 787.5 ms    10 runs

after refactor initialization

// 1
Time (mean ± σ):     592.6 ms ±   3.5 ms    [User: 588.3 ms, System: 3.5 ms]
Range (min … max):   589.3 ms … 600.7 ms    10 runs
 
// 2
Time (mean ± σ):     691.1 ms ±  55.8 ms    [User: 687.3 ms, System: 3.0 ms]
Range (min … max):   647.6 ms … 812.9 ms    10 runs
 
// 3
Time (mean ± σ):     775.6 ms ±  11.2 ms    [User: 771.6 ms, System: 3.2 ms]
Range (min … max):   762.8 ms … 801.7 ms    10 runs

after use EXPECTED

// 1
Time (mean ± σ):     595.0 ms ±   4.4 ms    [User: 590.6 ms, System: 3.6 ms]
Range (min … max):   588.1 ms … 603.0 ms    10 runs
 
// 2
Time (mean ± σ):     660.8 ms ±  16.2 ms    [User: 656.3 ms, System: 3.7 ms]
Range (min … max):   642.5 ms … 701.4 ms    10 runs
 
// 3
Time (mean ± σ):     770.1 ms ±   4.3 ms    [User: 766.1 ms, System: 3.2 ms]
Range (min … max):   763.9 ms … 777.5 ms    10 runs

FYI: without SIMD

// 1
Time (mean ± σ):     669.6 ms ±   5.0 ms    [User: 665.3 ms, System: 3.4 ms]
Range (min … max):   664.6 ms … 679.9 ms    10 runs
 
// 2
Time (mean ± σ):     774.0 ms ±  10.1 ms    [User: 770.0 ms, System: 3.2 ms]
Range (min … max):   761.1 ms … 792.2 ms    10 runs

// 3
Time (mean ± σ):     802.5 ms ±  13.4 ms    [User: 797.5 ms, System: 4.2 ms]
Range (min … max):   789.2 ms … 832.6 ms    10 runs

nielsdos

This looks pretty good.
The approach is quite smart. I tried something similar for addition as I said in private chat.
Something similar to this can be implemented for addition, and then we can compare our approaches to see which one is fastest.
I just have some nits / questions.

nielsdos · 2024-05-04T14:58:55Z

ext/bcmath/libbcmath/src/doaddsub.c

-	if (diff_len != min_len) {
-		for (count = diff_len - min_len; count > 0; count--) {
+	if (n1->n_len != n2->n_len) {
+		for (count = n1->n_len - n2->n_len; count > 0; count--) {


Previously we had

diff_len = MAX(n1->n_len, n2->n_len); min_len = MIN(n1->n_len, n2->n_len);

If n1->n_len was the larger of the two, then diff_len - min_len == n1->n_len - n2->n_len; otherwise diff_len - min_len == n2->n_len - n1->n_len.
So the change on line 230 is right, but I don't think that count = n1->n_len - n2->n_len; is right? Am I missing something?

If look at the caller of this, we can see that n1 is guaranteed to be the larger value :)

I see, n1 is assumed to be larger than n2.
Please put a ZEND_ASSERT in any case, just to be safe.

ZEND_ASSERT is slower, so I combined EXPECTED with comparison.

How can ZEND_ASSERT be slower? It is only a thing in a debug build

I think so too, but in my environment it actually slows down in release builds....

nielsdos · 2024-05-04T14:59:12Z

ext/bcmath/libbcmath/src/doaddsub.c

@@ -182,7 +171,51 @@ bc_num _bc_do_sub(bc_num n1, bc_num n2, size_t scale_min)
 	}

 	/* Now do the equal length scale and integer parts. */
-	for (count = 0; count < min_len + min_scale; count++) {
+	count = 0;
+	if (min_bytes > 8) {


Suggested change

if (min_bytes > 8) {

if (min_bytes >= 8) {

When it was exactly 8 bytes, I didn't include the equal sign because I thought the overhead would be large for parallel operations. Do you think the overhead is small?

Hmm, we should measure that.

Including equals was often slightly faster. (Almost unchanged)

nielsdos · 2024-05-04T14:59:17Z

ext/bcmath/libbcmath/src/doaddsub.c

+		diffptr++;
+		n1ptr++;
+		n2ptr++;
+		while (count + sizeof(uint64_t) < min_bytes) {


Suggested change

while (count + sizeof(uint64_t) < min_bytes) {

while (count + sizeof(uint64_t) <= min_bytes) {

nielsdos · 2024-05-04T15:14:50Z

Travis failure is legit: src/doaddsub.c:207: undefined reference to BSWAP64'`
Also, I think the code only works on little endian, so you should probably guard that.

SakiTakamachi · 2024-05-04T15:21:25Z

Travis failure is legit: src/doaddsub.c:207: undefined reference to BSWAP64'`
Also, I think the code only works on little endian, so you should probably guard that.

You're right, I haven't yet gotten that far with the code, but I am aware of the problem.

Girgias

I will just say that I really do not understand any of these byte manipulation tricks. I don't mind these things to be committed, but one of you should mark themselves on the CODEOWNER file for BCMath, as I will be unable to review any such changes to the extension.

Girgias · 2024-05-05T00:06:03Z

ext/bcmath/libbcmath/src/bcmath.h

+#  elif defined(__GNUC__) && ( \
+                    __GNUC__ > 4 || ( \
+                      __GNUC__ == 4 && ( \
+                        __GNUC_MINOR__ >= 3 \
+                      ) \
+                    ) \
+                  )
+#    define BSWAP32(u) __builtin_bswap32(u)
+#    define BSWAP64(u) __builtin_bswap64(u)
+#  endif // __has_builtin


Is this really worthwhile? GCC 4.4.3 is old (January 2010, over 14y ago), and convoluting the code like this for an optimization doesn't look amazing to me.

I didn't check it properly because I copied it directly from ext/hash. thank you.

ext/bcmath/libbcmath/src/bcmath.h

ext/bcmath/libbcmath/src/doaddsub.c

Moved SWAR_ONES and SWAR_REPEAT to header, and defined bytes swap etc.

SakiTakamachi · 2024-05-05T13:19:15Z

ext/bcmath/libbcmath/src/doaddsub.c

+	size_t diff_len = EXPECTED(n1->n_len >= n2->n_len) ? n1->n_len : n2->n_len;
+	size_t min_scale = MIN(n1->n_scale, n2->n_scale);
+	size_t min_len = n1->n_len >= n2->n_len ? n2->n_len : n1->n_len;


Interestingly, using EXPECTED in both places is slow.

SakiTakamachi · 2024-05-05T13:20:40Z

ext/bcmath/libbcmath/src/doaddsub.c

@@ -166,7 +163,56 @@ bc_num _bc_do_sub(bc_num n1, bc_num n2, size_t scale_min)
 	}

 	/* Now do the equal length scale and integer parts. */
-	for (count = 0; count < min_len + min_scale; count++) {
+	size_t sub_count = 0;


Analyzing the measurement results, it seems that sharing the count here and resetting it once is more expensive than using a separate variable.

Perhaps it's a scope issue?

That seems odd, did you check via hyperfine? Perhaps that's just coincidental noise, hyperfine can filter that out.

I think I measured it using hyperfine, but I'll separate the commits and measure them again individually.

Not much changed. In fact, it was a little late. Therefore, I will not make this change.

_bc_do_sub has been modified to use SIMD to perform faster calculations.

nielsdos

Some general remarks:

Please don't force push, it makes it more difficult to follow the changes. It's fine to have fixup commits because the history can be cleaned up after approval before merging.
We should start benchmarking using hyperfine to make sure we don't measure noise. With the runtime becoming much lower noise becomes a greater problem. To do that I recommend removing the timing code from your benchmark script and also reduce the number of iterations so that it executes in less than a second.

nielsdos · 2024-05-05T19:13:30Z

ext/bcmath/libbcmath/src/doaddsub.c

@@ -179,7 +226,7 @@ bc_num _bc_do_sub(bc_num n1, bc_num n2, size_t scale_min)

 	/* If n1 has more digits than n2, we now do that subtract. */
 	if (diff_len != min_len) {
-		for (count = diff_len - min_len; count > 0; count--) {
+		for (size_t count = diff_len - min_len; count > 0; count--) {


Why did you change this back to what it was previously? Did the change have negative performance impact?
In any case, I don't think changes like these belong to a SIMD PR anyway and should be done separately.

Ah, indeed. Separate the commits and measure each one.

Why did you change this back to what it was previously? Did the change have negative performance impact?

Combining EXPECTED with a comparison was faster than using ZEND_ASSERT.

nielsdos · 2024-05-05T19:13:51Z

ext/bcmath/libbcmath/src/doaddsub.c

+			n2bytes = BC_BSWAP(n2bytes);
+#endif
+
+			n1bytes -= (n2bytes + borrow);


You don't need the parentheses here

Suggested change

n1bytes -= (n2bytes + borrow);

n1bytes -= n2bytes + borrow;

nielsdos · 2024-05-05T19:18:20Z

ext/bcmath/libbcmath/src/doaddsub.c

+
+			n1bytes -= (n2bytes + borrow);
+			/* If the most significant 4 bits of the 8 bytes are not 0, a carry-down has occurred. */
+			bool tmp_borrow = n1bytes >= ((BC_UINT_T) 0x10 << (8 * (sizeof(BC_UINT_T) - 1)));


On first glance this can be simplified I think. Isn't it sufficient to only check the most significant bit instead? If that's correct, then it would be a faster check:

Suggested change

bool tmp_borrow = n1bytes >= ((BC_UINT_T) 0x10 << (8 * (sizeof(BC_UINT_T) - 1)));

bool tmp_borrow = n1bytes & ((BC_UINT_T) 1 << (8 * sizeof(BC_UINT_T) - 1));

This is good idea. Fixed it

nielsdos · 2024-05-05T19:19:45Z

ext/bcmath/libbcmath/src/doaddsub.c

+			 * calculation, so 6 is subtracted.
+			 * Also, set all upper 4 bits to 0.
+			 */
+			BC_UINT_T borrow_mask = (((n1bytes | (n1bytes >> 1) | (n1bytes >> 2) | (n1bytes >> 3)) & SWAR_REPEAT(0x10)) * 0x06) >> 4;


I'm not sure the following would be faster in practice, this should be measured, but what do you think of this. It avoids an additional shift.

Suggested change

BC_UINT_T borrow_mask = (((n1bytes | (n1bytes >> 1) | (n1bytes >> 2) | (n1bytes >> 3)) & SWAR_REPEAT(0x10)) * 0x06) >> 4;

BC_UINT_T borrow_mask = n1bytes | (n1bytes >> 1);

borrow_mask |= borrow_mask >> 2;

borrow_mask = ((borrow_mask & SWAR_REPEAT(0x10)) * 0x06) >> 4;

Hmm and actually, doesn't the same thing hold true here. Why do we have to check the top 4 bits, isn't just the top bit enough?

Indeed. I didn't notice that, thanks!

nielsdos · 2024-05-05T19:20:29Z

ext/bcmath/libbcmath/src/doaddsub.c

@@ -166,7 +163,56 @@ bc_num _bc_do_sub(bc_num n1, bc_num n2, size_t scale_min)
 	}

 	/* Now do the equal length scale and integer parts. */
-	for (count = 0; count < min_len + min_scale; count++) {
+	size_t sub_count = 0;


That seems odd, did you check via hyperfine? Perhaps that's just coincidental noise, hyperfine can filter that out.

nielsdos · 2024-05-05T19:21:08Z

ext/bcmath/libbcmath/src/bcmath.h

@@ -78,6 +78,66 @@ typedef struct bc_struct {
 #define LONG_MAX 0x7ffffff
 #endif

+/* This will be 0x01010101 for 32-bit and 0x0101010101010101 for 64-bit */


Perhaps this swar and byte magic stuff should be in a separate header, since it's really internal stuff. I'm not entirely sure though.

Is it ok to duplicate this in each file, or should I have a separate header file?

I think a new internal BC Math header is better, such headers exist for various extensions.

Thx, do you have a good name idea?

my ideas:

bcmath_utils.h

bcmath_ex.h

internal_bcmath.h

The PDO drivers and GMP use the *_int.h convention, Phar and ext/com use the *_internal.h convention, MySQLi and MySQLnd use the *_priv.h convention, and ext/filter and ext/curl use the *_private.h convention.

I think thus either using bcmath_internal.h or bcmath_private.h for the name is the best idea.

I got it.
I just noticed that private.h already exists in bcmath, is it reasonable to rename it to bcmath_private.h and use this? Or do you think it's better to create a new bcmath_internal.h since that is just for libbcmath?

Aren't these new routines just used within libbcmath? In which case reusing the existing private.h header makes sense to me

I moved some constants and macros to private.h

SakiTakamachi · 2024-05-06T01:18:31Z

I divided it into several commits and wrote the measurement results for each in the explanation.

SakiTakamachi · 2024-05-06T04:39:48Z

I forgot to edit some of my comments, I'll do that later.

SakiTakamachi · 2024-05-06T11:04:52Z

done

nielsdos

Nice work!
I'm not quite convinced about the strange behaviour around EXPECTED making things slower... I think what you're seeing is noise by recompilation: sometimes when you recompile you get a small performance difference (both positive and negative) even if nothing really changed in the assembly. This is often because of code layout changes.
Anyway, your changes on that code don't really hurt and the SIMD makes a measurable improvement.
It is possible to do something similar for the addition.

kocsismate · 2024-05-16T21:55:46Z

I managed to perf test this just now with the following results: https://gist.github.com/kocsismate/30116461d7d2c3cd7363254771626c38

Saki's code is consistently faster than the commit in master which her branch is based on :) The benchmarked code is basically the same one as mentioned above (I only measured the 1st and 2nd cases)

For transparency, here are the 2 scripts (I made their output compatible with bench.php and micro_bench.php so that my tooling can parse them):

<?php

// 1st case

$num1 = '1.2345678901234567890';
$num2 = '-2.12345678901234567890';

$hrtime1 = hrtime();

for ($i = 0; $i < 4000000; $i++) {
    bcadd($num1, $num2, 20);
}

$hrtime2 = hrtime();

$total = (($hrtime2[0]*1000000000 + $hrtime2[1]) / 1000000000) - (($hrtime1[0]*1000000000 + $hrtime1[1]) / 1000000000);

$pad = str_repeat("-", 27);
echo $pad."\n";
$num = number_format($total,6);
$pad = str_repeat(" ", 27-strlen("Total")-strlen($num));
echo "Total".$pad.$num."\n";

// 2nd case

<?php

$num1 = '12345678901234567890.12345678901234567890';
$num2 = '-212345678901234567890.12345678901234567890';

$hrtime1 = hrtime();

for ($i = 0; $i < 4000000; $i++) {
    bcadd($num1, $num2, 20);
}

$hrtime2 = hrtime();

$total = (($hrtime2[0]*1000000000 + $hrtime2[1]) / 1000000000) - (($hrtime1[0]*1000000000 + $hrtime1[1]) / 1000000000);

$pad = str_repeat("-", 27);
echo $pad."\n";
$num = number_format($total,6);
$pad = str_repeat(" ", 27-strlen("Total")-strlen($num));
echo "Total".$pad.$num."\n";

SakiTakamachi · 2024-05-16T22:53:44Z

Thanks!!

github-actions bot added the Extension: bcmath label May 4, 2024

nielsdos self-requested a review May 4, 2024 14:17

nielsdos reviewed May 4, 2024

View reviewed changes

Girgias reviewed May 5, 2024

View reviewed changes

SakiTakamachi force-pushed the refactor_bcmath_add_sub branch 2 times, most recently from c05b269 to 294fbbd Compare May 5, 2024 12:16

Added some definitions to bcmath.h

536d2f7

Moved SWAR_ONES and SWAR_REPEAT to header, and defined bytes swap etc.

SakiTakamachi force-pushed the refactor_bcmath_add_sub branch 4 times, most recently from 077989c to 9824755 Compare May 5, 2024 13:03

SakiTakamachi changed the title ~~[WIP] Refactor BCMath sub~~ Refactor BCMath _bc_do_sub May 5, 2024

SakiTakamachi marked this pull request as ready for review May 5, 2024 13:11

SakiTakamachi commented May 5, 2024

View reviewed changes

Refactor _bc_do_sub

670eada

_bc_do_sub has been modified to use SIMD to perform faster calculations.

SakiTakamachi force-pushed the refactor_bcmath_add_sub branch from 9824755 to 670eada Compare May 5, 2024 13:34

SakiTakamachi requested a review from nielsdos May 5, 2024 13:35

nielsdos requested changes May 5, 2024

View reviewed changes

SakiTakamachi added 5 commits May 6, 2024 07:20

restore it once

9a715bb

use SIMD

63f6968

refactor initialization

1dede5a

use EXPECTED

6a29fb7

Calculate min_len + min_scale first

ce78b14

SakiTakamachi requested a review from nielsdos May 6, 2024 01:18

Moved some constants and macros to private.h

2b6cbad

SakiTakamachi force-pushed the refactor_bcmath_add_sub branch from ec062b2 to 2b6cbad Compare May 6, 2024 03:53

Added and fixed comments

318601c

fixed comment

86a0084

nielsdos approved these changes May 6, 2024

View reviewed changes

SakiTakamachi merged commit 0273200 into php:master May 7, 2024
9 of 10 checks passed

SakiTakamachi deleted the refactor_bcmath_add_sub branch May 7, 2024 02:39

SakiTakamachi mentioned this pull request May 7, 2024

ext/bcmath: Use SIMD for trailing zero counts during conversion #14166

Merged

nielsdos mentioned this pull request May 7, 2024

BCMath: Avoid using the heap for temporary objects by using arena allocations #14170

Merged

SakiTakamachi mentioned this pull request May 8, 2024

ext/bcmath: Avoid unnecessary memset #14180

Merged

	while (count + sizeof(uint64_t) < min_bytes) {
	while (count + sizeof(uint64_t) <= min_bytes) {

	bool tmp_borrow = n1bytes >= ((BC_UINT_T) 0x10 << (8 * (sizeof(BC_UINT_T) - 1)));
	bool tmp_borrow = n1bytes & ((BC_UINT_T) 1 << (8 * sizeof(BC_UINT_T) - 1));

-			BC_UINT_T borrow_mask = (((n1bytes | (n1bytes >> 1) | (n1bytes >> 2) | (n1bytes >> 3)) & SWAR_REPEAT(0x10)) * 0x06) >> 4;
+			BC_UINT_T borrow_mask = n1bytes | (n1bytes >> 1);
+			borrow_mask |= borrow_mask >> 2;
+			borrow_mask = ((borrow_mask & SWAR_REPEAT(0x10)) * 0x06) >> 4;

Refactor BCMath _bc_do_sub #14132

Refactor BCMath _bc_do_sub #14132

Uh oh!

Conversation

SakiTakamachi commented May 4, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

compare

before results

after calculate min_len + min_scale first (final state)

Measurement results for each commit

after SIMD

after refactor initialization

after use EXPECTED

FYI: without SIMD

Uh oh!

nielsdos left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

nielsdos commented May 4, 2024

Uh oh!

SakiTakamachi commented May 4, 2024

Uh oh!

Girgias left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

SakiTakamachi May 5, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

nielsdos left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Refactor BCMath `_bc_do_sub` #14132

Refactor BCMath `_bc_do_sub` #14132

SakiTakamachi commented May 4, 2024 •

edited

Loading

after calculate `min_len + min_scale first` (final state)

SakiTakamachi May 5, 2024 •

edited

Loading

nielsdos May 5, 2024 •

edited

Loading