-
Notifications
You must be signed in to change notification settings - Fork 7.9k
Bcmath multiply speedup #14276
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Bcmath multiply speedup #14276
Conversation
ext/bcmath/libbcmath/src/recmul.c
Outdated
|
||
/* This LUT encodes the decimal representation of numbers 0-100 | ||
* such that we can avoid taking modulos and divisions which would be slow. */ | ||
static const unsigned short LUT[100] = { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
One thing I'm wondering too is if it makes sense to make the table smaller. I.e. encode every entry in 8 bits and expand it on writing. This makes the code slightly slower but reduces cache pressure. Not sure what's best.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In the current implementation, the size of the LUT is about 200 bytes? Even if change to 8bit, we only save 100 bytes, so I feel like we can leave it as is.
How do you think?
@@ -82,7 +138,7 @@ static inline void bc_convert_to_uint(BC_UINT_T *n_uint, const char *nend, size_ | |||
* If the n_values of n1 and n2 are both 4 (32-bit) or 8 (64-bit) digits or less, | |||
* the calculation will be performed at high speed without using an array. | |||
*/ | |||
static inline void bc_fast_mul(bc_num n1, size_t n1len, bc_num n2, int n2len, bc_num *prod) | |||
static inline void bc_fast_mul(bc_num n1, size_t n1len, bc_num n2, size_t n2len, bc_num *prod) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thx!
ext/bcmath/libbcmath/src/recmul.c
Outdated
|
||
/* This LUT encodes the decimal representation of numbers 0-100 | ||
* such that we can avoid taking modulos and divisions which would be slow. */ | ||
static const unsigned short LUT[100] = { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In the current implementation, the size of the LUT is about 200 bytes? Even if change to 8bit, we only save 100 bytes, so I feel like we can leave it as is.
How do you think?
About the LUT: I don't know if it's worth it to spend 100 bytes or 200 bytes, I'd have to measure it, but we must take into account that cache space spent for bcmath is maybe good for code that uses a lot of bcmath, and otherwise wastes useful space for applications that only only occasionally use bcmath. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't really understand commit 2 and the whole modulo thing.
But commit 1 and 3 are ok from what I understand :)
/* This is based on the technique described in https://kholdstare.github.io/technical/2020/05/26/faster-integer-parsing.html. | ||
* This function transforms AABBCCDD into 1000 * AA + 100 * BB + 10 * CC + DD, | ||
* with the caveat that all components must be in the interval [0, 25] to prevent overflow | ||
* due to the multiplication by power of 10 (10 * 25 = 250 is the largest number that fits in a byte). | ||
* The advantage of this method instead of using shifts + 3 multiplications is that this is cheaper | ||
* due to its divide-and-conquer nature. | ||
*/ |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This was a rather insightful post, and truly magic to me :)
Are you planning on using the SIMD intrisics?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Not atm, the SIMD handles 16 characters at once while we only do the byte tricks on 4/8 characters; so that doesn't line up.
Furthermore, SIMD would add even more complexity which isn't always great.
I'll rebase this after Saki's fix PR gets into master. I fixed the code style issue that Saki pointed out. I also checked at what reducing the LUT does for performance. Before LUT compression:
After LUT compression:
So on cases with a lot of iterations and a lot of different digits we see a decrease in performance, and performance has a tiny increase/stays steady for other cases. I'm tempted to keep the LUT compressed to reduce the cache pressure. |
If you profited from running the benchmarks with my framework, just reach out to me :) |
Thanks for the offer. I don't think it matters here too much because even without using fancy measurements we can see the performance benefit and what difference some nuances make. If we get down to more fine-grained optimizations we might use your framework 🙂 |
Apart from the conflicting parts, it looks very good for me. I was relieved to see that there didn't seem to be any big changes in performance when it came to LUT :) |
a27f5e8
to
ae19471
Compare
Using the benchmark from #14213
Marking as draft to run in CI as I'm not sure I got the endianness right in all places.
Before:
After "Faster writing of BCD representation":
After "Faster BCD into integer parsing":