Skip to content

Bcmath multiply speedup #14276

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 3 commits into from
May 23, 2024
Merged

Bcmath multiply speedup #14276

merged 3 commits into from
May 23, 2024

Conversation

nielsdos
Copy link
Member

Using the benchmark from #14213

Marking as draft to run in CI as I'm not sure I got the endianness right in all places.

Before:

1.316251039505
0.10821485519409
2.3945400714874
0.81185293197632
2.3537800312042

After "Faster writing of BCD representation":

1.1644039154053
0.094422101974487
1.7769658565521
0.66193199157715
2.0062921047211

After "Faster BCD into integer parsing":

1.0797188282013
0.08667516708374
1.4775879383087
0.42826008796692
1.6225161552429


/* This LUT encodes the decimal representation of numbers 0-100
* such that we can avoid taking modulos and divisions which would be slow. */
static const unsigned short LUT[100] = {
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

One thing I'm wondering too is if it makes sense to make the table smaller. I.e. encode every entry in 8 bits and expand it on writing. This makes the code slightly slower but reduces cache pressure. Not sure what's best.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In the current implementation, the size of the LUT is about 200 bytes? Even if change to 8bit, we only save 100 bytes, so I feel like we can leave it as is.

How do you think?

@nielsdos nielsdos marked this pull request as ready for review May 20, 2024 16:54
@@ -82,7 +138,7 @@ static inline void bc_convert_to_uint(BC_UINT_T *n_uint, const char *nend, size_
* If the n_values of n1 and n2 are both 4 (32-bit) or 8 (64-bit) digits or less,
* the calculation will be performed at high speed without using an array.
*/
static inline void bc_fast_mul(bc_num n1, size_t n1len, bc_num n2, int n2len, bc_num *prod)
static inline void bc_fast_mul(bc_num n1, size_t n1len, bc_num n2, size_t n2len, bc_num *prod)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thx!


/* This LUT encodes the decimal representation of numbers 0-100
* such that we can avoid taking modulos and divisions which would be slow. */
static const unsigned short LUT[100] = {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In the current implementation, the size of the LUT is about 200 bytes? Even if change to 8bit, we only save 100 bytes, so I feel like we can leave it as is.

How do you think?

@nielsdos
Copy link
Member Author

nielsdos commented May 21, 2024

About the LUT: I don't know if it's worth it to spend 100 bytes or 200 bytes, I'd have to measure it, but we must take into account that cache space spent for bcmath is maybe good for code that uses a lot of bcmath, and otherwise wastes useful space for applications that only only occasionally use bcmath.

Copy link
Member

@Girgias Girgias left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't really understand commit 2 and the whole modulo thing.

But commit 1 and 3 are ok from what I understand :)

Comment on lines +51 to +67
/* This is based on the technique described in https://kholdstare.github.io/technical/2020/05/26/faster-integer-parsing.html.
* This function transforms AABBCCDD into 1000 * AA + 100 * BB + 10 * CC + DD,
* with the caveat that all components must be in the interval [0, 25] to prevent overflow
* due to the multiplication by power of 10 (10 * 25 = 250 is the largest number that fits in a byte).
* The advantage of this method instead of using shifts + 3 multiplications is that this is cheaper
* due to its divide-and-conquer nature.
*/
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This was a rather insightful post, and truly magic to me :)

Are you planning on using the SIMD intrisics?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not atm, the SIMD handles 16 characters at once while we only do the byte tricks on 4/8 characters; so that doesn't line up.
Furthermore, SIMD would add even more complexity which isn't always great.

@nielsdos
Copy link
Member Author

I'll rebase this after Saki's fix PR gets into master.

I fixed the code style issue that Saki pointed out.

I also checked at what reducing the LUT does for performance.

Before LUT compression:

1.084851026535
0.089287042617798
1.5064918994904
0.43651294708252
1.648561000824

After LUT compression:

1.0662579536438
0.091488838195801
1.5670568943024
0.46629309654236
1.7126259803772

So on cases with a lot of iterations and a lot of different digits we see a decrease in performance, and performance has a tiny increase/stays steady for other cases. I'm tempted to keep the LUT compressed to reduce the cache pressure.

@kocsismate
Copy link
Member

kocsismate commented May 22, 2024

If you profited from running the benchmarks with my framework, just reach out to me :)

@nielsdos
Copy link
Member Author

If you profited from running the benchmarks with my framework, just reach out to me :)

Thanks for the offer. I don't think it matters here too much because even without using fancy measurements we can see the performance benefit and what difference some nuances make. If we get down to more fine-grained optimizations we might use your framework 🙂

@SakiTakamachi
Copy link
Member

Apart from the conflicting parts, it looks very good for me.

I was relieved to see that there didn't seem to be any big changes in performance when it came to LUT :)

@nielsdos nielsdos force-pushed the bcmath-mul-speedup branch from a27f5e8 to ae19471 Compare May 23, 2024 20:38
@nielsdos nielsdos merged commit 4e99bb5 into php:master May 23, 2024
6 of 10 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants