-
Notifications
You must be signed in to change notification settings - Fork 7.9k
ext/bcmath: Changed the bcmul calculation method #14213
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
CI is red |
Multiplication is performed after converting to unsigned long, resulting in faster calculations.
7a944c7
to
899bb39
Compare
I need to change some of the constant names and comments. |
done! |
The red one in CI is openssl. |
There were some corrections that were omitted, so I corrected the variable names and comments. I think everything else is probably fine. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nice work. Some comments:
BC_UINT_T num = 0; | ||
BC_UINT_T base = 1; | ||
|
||
for (size_t i = 0; i < len; i++) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I once shared a fast way of doing this that I had implemented in a WIP commit: nielsdos@20783c3
Original source: https://kholdstare.github.io/technical/2020/05/26/faster-integer-parsing.html
The article claims it is faster than a regular for loop that multiplies by 10.
(note: I wouldn't use the SSE SIMD approach from the article, just the 4/8 byte fast non-SIMD approach)
However, let's not do that in this PR because this PR is large enough as-is.
ext/bcmath/libbcmath/src/recmul.c
Outdated
Let u = u0 + u1*(b^n) | ||
Let v = v0 + v1*(b^n) | ||
Then uv = (B^2n+B^n)*u1*v1 + B^n*(u1-u0)*(v0-v1) + (B^n+1)*u0*v0 | ||
BC_UINT_T *buf = emalloc((n1_arr_size + n2_arr_size + prod_arr_size) * sizeof(BC_UINT_T)); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Allocations with multiplications shouldn't be open-coded, because they risk getting overflown.
Use safe_emalloc here please.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Fixed it
} | ||
*pvptr-- = sum % BASE; | ||
sum = sum / BASE; | ||
while (pend >= pptr) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I also had a faster way than this of writing the BCD representation back. See nielsdos@20783c3#diff-5550a7de511402b266a0fc932bd11298d2e02ef720879507b89651add38f380dR177
But we should keep that as a follow-up imo.
ext/bcmath/libbcmath/src/recmul.c
Outdated
_bc_rec_mul(u1, u1->n_len, v1, v1->n_len, &m1); | ||
/* Convert n2 to uint[] */ | ||
i = 0; | ||
while (n2len > 0) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hmmm, we're doing the same for n1, would be great if this was factored out.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
fixed it
char *pend = pptr + prodlen - 1; | ||
i = 0; | ||
while (i < prod_arr_size - 1) { | ||
for (size_t j = 0; j < BC_MUL_UINT_DIGITS; j++) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Also an opportunity for the faster BCD write method for follow-up.
|
||
/* Do the multiply */ | ||
_bc_rec_mul(n1, len1, n2, len2, &pval); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
One thing we're going to lose here now is the divide-and-conquer approach for very large numbers, and I think we should probably keep that. WDYT?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Good point. Sorry, I should have explained this in the PR description.
The existing implementation uses divide-and-conquer when the number of digits is 80 or more.
This is compared in test case 2, and clearly this PR is faster, so it's not worth keeping the original implementation as is.
However, combining this PR approach with a divide and conquer approach has shown, in my local testing, to probably be nearly twice as effective.
Therefore, we should use the divide and conquer method (commonly called the Karatsuba algorithm in Japanese), but probably separate PRs.
Co-authored-by: Niels Dossche <[email protected]>
Co-authored-by: Gina Peter Banyard <[email protected]>
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Well, it's a good thing I allocated time for this, took me a while to think about this.
It looks right except for some small nits.
ext/bcmath/libbcmath/src/recmul.c
Outdated
*/ | ||
static void _bc_rec_mul(bc_num u, size_t ulen, bc_num v, size_t vlen, bc_num *prod) | ||
/* | ||
* If the n_values of n1 and n2 are both 4 (32-bit) or 8 (64-bit) digits or less, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I can see in my IDE there's zero-width-spaces on this line.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It seems that Google Translate mixed it up. Fixed.
ext/bcmath/libbcmath/src/recmul.c
Outdated
* e.g. 12345678901234567890 => {34567890, 56789012, 1234} | ||
* | ||
* Multiply and add these groups of numbers to perform multiplication fast. | ||
* How much to shift the digits when adding values can be calculated from the index of the array. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Also zero-width-spaces on this line.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Fixed
|
||
bc_num d1 = bc_sub(u1, u0, 0); | ||
bc_num d2 = bc_sub(v0, v1, 0); | ||
BC_UINT_T *buf = safe_emalloc(n1_arr_size + n2_arr_size + prod_arr_size, sizeof(BC_UINT_T), 0); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think this deserves a comment to explain why the addition can't overflow in practice.
I.e. let's say that N
is the max of n1len
and n2len
(and a multiple of BC_MUL_UINT_DIGITS for simplicity), then this sum is <= N/BC_MUL_UINT_DIGITS + N/BC_MUL_UINT_DIGITS + N/BC_MUL_UINT_DIGITS + N/BC_MUL_UINT_DIGITS - 1
which is equal to N - 1
if BC_MUL_UINT_DIGITS
is 4, and N/2 - 1
if BC_MUL_UINT_DIGITS
is 8.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Added, thx!
Co-authored-by: Niels Dossche <[email protected]>
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Seems right. Maybe @Girgias wants to have a final look?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks fine to me
This broken 32-bit builds, see https://github.com/php/php-src/actions/runs/9183895160/job/25255358600. @SakiTakamachi Please have a look. |
Okay, thanks! |
There are still ways to make it faster, but since it would complicate the PR, I will split it into a follow-up PR (
#14229).The difference is quite large, so I will only post rough measurements.
Bench code:
before
after