-
Notifications
You must be signed in to change notification settings - Fork 7.9k
Refactor BCMath _bc_do_sub
#14132
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Refactor BCMath _bc_do_sub
#14132
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This looks pretty good.
The approach is quite smart. I tried something similar for addition as I said in private chat.
Something similar to this can be implemented for addition, and then we can compare our approaches to see which one is fastest.
I just have some nits / questions.
ext/bcmath/libbcmath/src/doaddsub.c
Outdated
if (diff_len != min_len) { | ||
for (count = diff_len - min_len; count > 0; count--) { | ||
if (n1->n_len != n2->n_len) { | ||
for (count = n1->n_len - n2->n_len; count > 0; count--) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Previously we had
diff_len = MAX(n1->n_len, n2->n_len);
min_len = MIN(n1->n_len, n2->n_len);
If n1->n_len
was the larger of the two, then diff_len - min_len == n1->n_len - n2->n_len
; otherwise diff_len - min_len == n2->n_len - n1->n_len
.
So the change on line 230 is right, but I don't think that count = n1->n_len - n2->n_len;
is right? Am I missing something?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If look at the caller of this, we can see that n1 is guaranteed to be the larger value :)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I see, n1 is assumed to be larger than n2.
Please put a ZEND_ASSERT in any case, just to be safe.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
ZEND_ASSERT is slower, so I combined EXPECTED with comparison.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
How can ZEND_ASSERT be slower? It is only a thing in a debug build
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think so too, but in my environment it actually slows down in release builds....
ext/bcmath/libbcmath/src/doaddsub.c
Outdated
@@ -182,7 +171,51 @@ bc_num _bc_do_sub(bc_num n1, bc_num n2, size_t scale_min) | |||
} | |||
|
|||
/* Now do the equal length scale and integer parts. */ | |||
for (count = 0; count < min_len + min_scale; count++) { | |||
count = 0; | |||
if (min_bytes > 8) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
if (min_bytes > 8) { | |
if (min_bytes >= 8) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
When it was exactly 8 bytes, I didn't include the equal sign because I thought the overhead would be large for parallel operations. Do you think the overhead is small?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hmm, we should measure that.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Including equals was often slightly faster. (Almost unchanged)
ext/bcmath/libbcmath/src/doaddsub.c
Outdated
diffptr++; | ||
n1ptr++; | ||
n2ptr++; | ||
while (count + sizeof(uint64_t) < min_bytes) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
while (count + sizeof(uint64_t) < min_bytes) { | |
while (count + sizeof(uint64_t) <= min_bytes) { |
Travis failure is legit: |
You're right, I haven't yet gotten that far with the code, but I am aware of the problem. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I will just say that I really do not understand any of these byte manipulation tricks. I don't mind these things to be committed, but one of you should mark themselves on the CODEOWNER file for BCMath, as I will be unable to review any such changes to the extension.
ext/bcmath/libbcmath/src/bcmath.h
Outdated
# elif defined(__GNUC__) && ( \ | ||
__GNUC__ > 4 || ( \ | ||
__GNUC__ == 4 && ( \ | ||
__GNUC_MINOR__ >= 3 \ | ||
) \ | ||
) \ | ||
) | ||
# define BSWAP32(u) __builtin_bswap32(u) | ||
# define BSWAP64(u) __builtin_bswap64(u) | ||
# endif // __has_builtin |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is this really worthwhile? GCC 4.4.3 is old (January 2010, over 14y ago), and convoluting the code like this for an optimization doesn't look amazing to me.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I didn't check it properly because I copied it directly from ext/hash. thank you.
c05b269
to
294fbbd
Compare
Moved SWAR_ONES and SWAR_REPEAT to header, and defined bytes swap etc.
077989c
to
9824755
Compare
ext/bcmath/libbcmath/src/doaddsub.c
Outdated
size_t diff_len = EXPECTED(n1->n_len >= n2->n_len) ? n1->n_len : n2->n_len; | ||
size_t min_scale = MIN(n1->n_scale, n2->n_scale); | ||
size_t min_len = n1->n_len >= n2->n_len ? n2->n_len : n1->n_len; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Interestingly, using EXPECTED
in both places is slow.
ext/bcmath/libbcmath/src/doaddsub.c
Outdated
@@ -166,7 +163,56 @@ bc_num _bc_do_sub(bc_num n1, bc_num n2, size_t scale_min) | |||
} | |||
|
|||
/* Now do the equal length scale and integer parts. */ | |||
for (count = 0; count < min_len + min_scale; count++) { | |||
size_t sub_count = 0; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Analyzing the measurement results, it seems that sharing the count here and resetting it once is more expensive than using a separate variable.
Perhaps it's a scope issue?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
That seems odd, did you check via hyperfine? Perhaps that's just coincidental noise, hyperfine can filter that out.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think I measured it using hyperfine, but I'll separate the commits and measure them again individually.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Not much changed. In fact, it was a little late. Therefore, I will not make this change.
_bc_do_sub has been modified to use SIMD to perform faster calculations.
9824755
to
670eada
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Some general remarks:
- Please don't force push, it makes it more difficult to follow the changes. It's fine to have fixup commits because the history can be cleaned up after approval before merging.
- We should start benchmarking using hyperfine to make sure we don't measure noise. With the runtime becoming much lower noise becomes a greater problem. To do that I recommend removing the timing code from your benchmark script and also reduce the number of iterations so that it executes in less than a second.
ext/bcmath/libbcmath/src/doaddsub.c
Outdated
@@ -179,7 +226,7 @@ bc_num _bc_do_sub(bc_num n1, bc_num n2, size_t scale_min) | |||
|
|||
/* If n1 has more digits than n2, we now do that subtract. */ | |||
if (diff_len != min_len) { | |||
for (count = diff_len - min_len; count > 0; count--) { | |||
for (size_t count = diff_len - min_len; count > 0; count--) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why did you change this back to what it was previously? Did the change have negative performance impact?
In any case, I don't think changes like these belong to a SIMD PR anyway and should be done separately.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ah, indeed. Separate the commits and measure each one.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I fixed it
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why did you change this back to what it was previously? Did the change have negative performance impact?
Combining EXPECTED with a comparison was faster than using ZEND_ASSERT.
ext/bcmath/libbcmath/src/doaddsub.c
Outdated
n2bytes = BC_BSWAP(n2bytes); | ||
#endif | ||
|
||
n1bytes -= (n2bytes + borrow); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You don't need the parentheses here
n1bytes -= (n2bytes + borrow); | |
n1bytes -= n2bytes + borrow; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
fixed it
ext/bcmath/libbcmath/src/doaddsub.c
Outdated
|
||
n1bytes -= (n2bytes + borrow); | ||
/* If the most significant 4 bits of the 8 bytes are not 0, a carry-down has occurred. */ | ||
bool tmp_borrow = n1bytes >= ((BC_UINT_T) 0x10 << (8 * (sizeof(BC_UINT_T) - 1))); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
On first glance this can be simplified I think. Isn't it sufficient to only check the most significant bit instead? If that's correct, then it would be a faster check:
bool tmp_borrow = n1bytes >= ((BC_UINT_T) 0x10 << (8 * (sizeof(BC_UINT_T) - 1))); | |
bool tmp_borrow = n1bytes & ((BC_UINT_T) 1 << (8 * sizeof(BC_UINT_T) - 1)); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is good idea. Fixed it
ext/bcmath/libbcmath/src/doaddsub.c
Outdated
* calculation, so 6 is subtracted. | ||
* Also, set all upper 4 bits to 0. | ||
*/ | ||
BC_UINT_T borrow_mask = (((n1bytes | (n1bytes >> 1) | (n1bytes >> 2) | (n1bytes >> 3)) & SWAR_REPEAT(0x10)) * 0x06) >> 4; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm not sure the following would be faster in practice, this should be measured, but what do you think of this. It avoids an additional shift.
BC_UINT_T borrow_mask = (((n1bytes | (n1bytes >> 1) | (n1bytes >> 2) | (n1bytes >> 3)) & SWAR_REPEAT(0x10)) * 0x06) >> 4; | |
BC_UINT_T borrow_mask = n1bytes | (n1bytes >> 1); | |
borrow_mask |= borrow_mask >> 2; | |
borrow_mask = ((borrow_mask & SWAR_REPEAT(0x10)) * 0x06) >> 4; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hmm and actually, doesn't the same thing hold true here. Why do we have to check the top 4 bits, isn't just the top bit enough?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Indeed. I didn't notice that, thanks!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
fixed it
ext/bcmath/libbcmath/src/doaddsub.c
Outdated
@@ -166,7 +163,56 @@ bc_num _bc_do_sub(bc_num n1, bc_num n2, size_t scale_min) | |||
} | |||
|
|||
/* Now do the equal length scale and integer parts. */ | |||
for (count = 0; count < min_len + min_scale; count++) { | |||
size_t sub_count = 0; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
That seems odd, did you check via hyperfine? Perhaps that's just coincidental noise, hyperfine can filter that out.
ext/bcmath/libbcmath/src/bcmath.h
Outdated
@@ -78,6 +78,66 @@ typedef struct bc_struct { | |||
#define LONG_MAX 0x7ffffff | |||
#endif | |||
|
|||
/* This will be 0x01010101 for 32-bit and 0x0101010101010101 for 64-bit */ |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Perhaps this swar and byte magic stuff should be in a separate header, since it's really internal stuff. I'm not entirely sure though.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is it ok to duplicate this in each file, or should I have a separate header file?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think a new internal BC Math header is better, such headers exist for various extensions.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thx, do you have a good name idea?
my ideas:
- bcmath_utils.h
- bcmath_ex.h
- internal_bcmath.h
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The PDO drivers and GMP use the *_int.h
convention, Phar and ext/com use the *_internal.h
convention, MySQLi and MySQLnd use the *_priv.h
convention, and ext/filter and ext/curl use the *_private.h
convention.
I think thus either using bcmath_internal.h
or bcmath_private.h
for the name is the best idea.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I got it.
I just noticed that private.h
already exists in bcmath, is it reasonable to rename it to bcmath_private.h
and use this? Or do you think it's better to create a new bcmath_internal.h
since that is just for libbcmath?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Aren't these new routines just used within libbcmath? In which case reusing the existing private.h
header makes sense to me
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Okay, thx!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I moved some constants and macros to private.h
I divided it into several commits and wrote the measurement results for each in the explanation. |
ec062b2
to
2b6cbad
Compare
I forgot to edit some of my comments, I'll do that later. |
done |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nice work!
I'm not quite convinced about the strange behaviour around EXPECTED making things slower... I think what you're seeing is noise by recompilation: sometimes when you recompile you get a small performance difference (both positive and negative) even if nothing really changed in the assembly. This is often because of code layout changes.
Anyway, your changes on that code don't really hurt and the SIMD makes a measurable improvement.
It is possible to do something similar for the addition.
I managed to perf test this just now with the following results: https://gist.github.com/kocsismate/30116461d7d2c3cd7363254771626c38 Saki's code is consistently faster than the commit in master which her branch is based on :) The benchmarked code is basically the same one as mentioned above (I only measured the 1st and 2nd cases) For transparency, here are the 2 scripts (I made their output compatible with bench.php and micro_bench.php so that my tooling can parse them):
|
Thanks!! |
I thought about separating SIMD and other changes into separate PRs, but it would be slow to measure them individually, so I combined them into one. There seems to be a synergistic effect.
Measure using hyperfine. Warm-up is 10 times.
Cases
1:
2:
3:
compare
before results
after calculate
min_len + min_scale first
(final state)Measurement results for each commit
after SIMD
after refactor initialization
after use EXPECTED
FYI: without SIMD