Bcmath multiply speedup #14276

nielsdos · 2024-05-20T13:44:28Z

Using the benchmark from #14213

Marking as draft to run in CI as I'm not sure I got the endianness right in all places.

Before:

1.316251039505
0.10821485519409
2.3945400714874
0.81185293197632
2.3537800312042

After "Faster writing of BCD representation":

1.1644039154053
0.094422101974487
1.7769658565521
0.66193199157715
2.0062921047211

After "Faster BCD into integer parsing":

1.0797188282013
0.08667516708374
1.4775879383087
0.42826008796692
1.6225161552429

nielsdos · 2024-05-20T13:53:59Z

ext/bcmath/libbcmath/src/recmul.c

+
+/* This LUT encodes the decimal representation of numbers 0-100
+ * such that we can avoid taking modulos and divisions which would be slow. */
+static const unsigned short LUT[100] = {


One thing I'm wondering too is if it makes sense to make the table smaller. I.e. encode every entry in 8 bits and expand it on writing. This makes the code slightly slower but reduces cache pressure. Not sure what's best.

In the current implementation, the size of the LUT is about 200 bytes? Even if change to 8bit, we only save 100 bytes, so I feel like we can leave it as is.

How do you think?

ext/bcmath/libbcmath/src/recmul.c

SakiTakamachi · 2024-05-20T23:11:32Z

ext/bcmath/libbcmath/src/recmul.c

@@ -82,7 +138,7 @@ static inline void bc_convert_to_uint(BC_UINT_T *n_uint, const char *nend, size_
 * If the n_values of n1 and n2 are both 4 (32-bit) or 8 (64-bit) digits or less,
 * the calculation will be performed at high speed without using an array.
 */
-static inline void bc_fast_mul(bc_num n1, size_t n1len, bc_num n2, int n2len, bc_num *prod)
+static inline void bc_fast_mul(bc_num n1, size_t n1len, bc_num n2, size_t n2len, bc_num *prod)


SakiTakamachi · 2024-05-20T23:17:33Z

ext/bcmath/libbcmath/src/recmul.c

+
+/* This LUT encodes the decimal representation of numbers 0-100
+ * such that we can avoid taking modulos and divisions which would be slow. */
+static const unsigned short LUT[100] = {


In the current implementation, the size of the LUT is about 200 bytes? Even if change to 8bit, we only save 100 bytes, so I feel like we can leave it as is.

How do you think?

ext/bcmath/libbcmath/src/recmul.c

nielsdos · 2024-05-21T06:18:19Z

About the LUT: I don't know if it's worth it to spend 100 bytes or 200 bytes, I'd have to measure it, but we must take into account that cache space spent for bcmath is maybe good for code that uses a lot of bcmath, and otherwise wastes useful space for applications that only only occasionally use bcmath.

Girgias

I don't really understand commit 2 and the whole modulo thing.

But commit 1 and 3 are ok from what I understand :)

Girgias · 2024-05-21T10:50:48Z

ext/bcmath/libbcmath/src/recmul.c

+/* This is based on the technique described in https://kholdstare.github.io/technical/2020/05/26/faster-integer-parsing.html.
+ * This function transforms AABBCCDD into 1000 * AA + 100 * BB + 10 * CC + DD,
+ * with the caveat that all components must be in the interval [0, 25] to prevent overflow
+ * due to the multiplication by power of 10 (10 * 25 = 250 is the largest number that fits in a byte).
+ * The advantage of this method instead of using shifts + 3 multiplications is that this is cheaper
+ * due to its divide-and-conquer nature.
+ */


This was a rather insightful post, and truly magic to me :)

Are you planning on using the SIMD intrisics?

Not atm, the SIMD handles 16 characters at once while we only do the byte tricks on 4/8 characters; so that doesn't line up.
Furthermore, SIMD would add even more complexity which isn't always great.

nielsdos · 2024-05-22T16:18:43Z

I'll rebase this after Saki's fix PR gets into master.

I fixed the code style issue that Saki pointed out.

I also checked at what reducing the LUT does for performance.

Before LUT compression:

1.084851026535
0.089287042617798
1.5064918994904
0.43651294708252
1.648561000824

After LUT compression:

1.0662579536438
0.091488838195801
1.5670568943024
0.46629309654236
1.7126259803772

So on cases with a lot of iterations and a lot of different digits we see a decrease in performance, and performance has a tiny increase/stays steady for other cases. I'm tempted to keep the LUT compressed to reduce the cache pressure.

kocsismate · 2024-05-22T20:12:27Z

If you profited from running the benchmarks with my framework, just reach out to me :)

nielsdos · 2024-05-22T20:23:52Z

If you profited from running the benchmarks with my framework, just reach out to me :)

Thanks for the offer. I don't think it matters here too much because even without using fancy measurements we can see the performance benefit and what difference some nuances make. If we get down to more fine-grained optimizations we might use your framework 🙂

SakiTakamachi · 2024-05-23T00:37:26Z

Apart from the conflicting parts, it looks very good for me.

I was relieved to see that there didn't seem to be any big changes in performance when it came to LUT :)

github-actions bot added the Extension: bcmath label May 20, 2024

nielsdos commented May 20, 2024

View reviewed changes

nielsdos marked this pull request as ready for review May 20, 2024 16:54

nielsdos requested review from Girgias and SakiTakamachi as code owners May 20, 2024 16:54

SakiTakamachi reviewed May 20, 2024

View reviewed changes

Girgias reviewed May 21, 2024

View reviewed changes

SakiTakamachi approved these changes May 22, 2024

View reviewed changes

Use size_t for length

7999365

nielsdos force-pushed the bcmath-mul-speedup branch from a27f5e8 to ae19471 Compare May 23, 2024 20:38

nielsdos merged commit 4e99bb5 into php:master May 23, 2024
6 of 10 checks passed

nielsdos added 2 commits May 23, 2024 22:38

Faster writing of BCD representation

6f79182

Faster BCD into integer parsing

ae19471

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Bcmath multiply speedup #14276

Bcmath multiply speedup #14276

Uh oh!

nielsdos commented May 20, 2024

Uh oh!

nielsdos May 20, 2024

Uh oh!

SakiTakamachi May 20, 2024

Uh oh!

Uh oh!

SakiTakamachi May 20, 2024

Uh oh!

SakiTakamachi May 20, 2024

Uh oh!

Uh oh!

nielsdos commented May 21, 2024 •

edited

Loading

Uh oh!

Girgias left a comment

Uh oh!

Girgias May 21, 2024

Uh oh!

nielsdos May 22, 2024

Uh oh!

nielsdos commented May 22, 2024

Uh oh!

kocsismate commented May 22, 2024 •

edited

Loading

Uh oh!

nielsdos commented May 22, 2024

Uh oh!

SakiTakamachi commented May 23, 2024

Uh oh!

Uh oh!

Uh oh!

Bcmath multiply speedup #14276

Bcmath multiply speedup #14276

Uh oh!

Conversation

nielsdos commented May 20, 2024

Uh oh!

nielsdos May 20, 2024

Choose a reason for hiding this comment

Uh oh!

SakiTakamachi May 20, 2024

Choose a reason for hiding this comment

Uh oh!

Uh oh!

SakiTakamachi May 20, 2024

Choose a reason for hiding this comment

Uh oh!

SakiTakamachi May 20, 2024

Choose a reason for hiding this comment

Uh oh!

Uh oh!

nielsdos commented May 21, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Girgias left a comment

Choose a reason for hiding this comment

Uh oh!

Girgias May 21, 2024

Choose a reason for hiding this comment

Uh oh!

nielsdos May 22, 2024

Choose a reason for hiding this comment

Uh oh!

nielsdos commented May 22, 2024

Uh oh!

kocsismate commented May 22, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

nielsdos commented May 22, 2024

Uh oh!

SakiTakamachi commented May 23, 2024

Uh oh!

Uh oh!

Uh oh!

nielsdos commented May 21, 2024 •

edited

Loading

kocsismate commented May 22, 2024 •

edited

Loading