UTF-8 Validation optimization for NEON using mb_check_encoding #11076

youkidearitai · 2023-04-14T17:44:43Z

Add UTF-8 Validation optimization for NEON.

However, it is not seems speed acceleration on M1 macOS...
macOS compile set to configure in CPPFLAGS='-g -O3 -Wall' then faster maybe 1.7x.
Ubuntu (GCC) compiler faster maybe 1.7x.

This Pull Request is almost copied Opensource another code.(MIT License)

https://github.com/cyb70289/utf8

refer from below:

https://arxiv.org/pdf/2010.03090.pdf
https://qiita.com/saka1_p/items/9719d71e702aed92a07c (Sorry, this site is Japanese)

Possibly, there may be an omission. I would be happy if you could find it.

I made it because I thought it would be good to discuss whether NEON should be included.

FYA @alexdowad @Girgias @pakutoma

However, it is not seems speed accelation on M1 macOS...

alexdowad · 2023-04-14T18:47:08Z

@youkidearitai Thanks very much!!

@cmb69 @Girgias Can we import MIT-licensed code into php-src? Should we?

@youkidearitai So it looks like this code validates 16 bytes at a time. If so, a 1.7x speedup does seem less than expected.

I don't have a Mac, so I can't build this code and test it. I wonder, what happens if you build the benchmark code from cyb70289/utf8 and run it? Do you get results which are similar to theirs?

For x86 SIMD (SSE2, AVX, etc), I have found that one possible performance drain is constantly reloading constant values into an SIMD register. So I wonder if that is happening in this code or not. I can see that code like vdupq_n_u8(0xF4) is used to get an SIMD register with a constant value... does it run faster if you assign that to a variable outside the loop and then use the variable instead? (Alternatively, if you are able to read assembly code, you could look at the disassembly and see if the compiler is pulling that operation outside of the loop or not.)

alexdowad · 2023-04-14T18:47:39Z

@iluuu1994 I think you are currently working on pulling my x86 SIMD UTF-8 validation routine into Zend core... please take note of what @youkidearitai is doing here as well.

youkidearitai · 2023-04-14T21:46:17Z

@alexdowad Thank you very much. Let me do it slowly and carefully. When I was implementing it, I found a similar source and used it as a reference. I'm trying to read assembly code.

If license is problem or implementation is feels difficult, I will close this PR.

Girgias · 2023-04-14T22:08:45Z

I think bundling MIT code is fine, but @bukka or @derickr would know better.

Also I'm currently on holiday for 3 weeks so won't be doing any reviews for the time being (except if I somehow become very bored).

youkidearitai · 2023-04-14T22:24:25Z

It harder than I thought. I'm closing this PR. Thanks advice @alexdowad and @Girgias .

--enable-debug option is enabled, NEON to be disabled. This is seems difficult.

php`neon_check_utf8_bytes:
    0x100221df8 <+0>:    stp    x28, x27, [sp, #-0x10]!
    0x100221dfc <+4>:    sub    sp, sp, #0xac0
    0x100221e00 <+8>:    str    q0, [sp, #0x2c0]
    0x100221e04 <+12>:   str    x0, [sp, #0x2b8]
    0x100221e08 <+16>:   str    x1, [sp, #0x2b0]
->  0x100221e0c <+20>:   ldr    q0, [sp, #0x2c0]
    0x100221e10 <+24>:   str    q0, [sp, #0x280]
    0x100221e14 <+28>:   ldr    q0, [sp, #0x2c0]
    0x100221e18 <+32>:   str    q0, [sp, #0x750]
    0x100221e1c <+36>:   ldr    q0, [sp, #0x750]
    0x100221e20 <+40>:   str    q0, [sp, #0x740]
    0x100221e24 <+44>:   ldr    q0, [sp, #0x740]
    0x100221e28 <+48>:   str    q0, [sp, #0x260]
    0x100221e2c <+52>:   ldr    q0, [sp, #0x260]
    0x100221e30 <+56>:   ushr.16b v0, v0, #0x4
    0x100221e34 <+60>:   str    q0, [sp, #0x270]
    0x100221e38 <+64>:   ldr    q0, [sp, #0x270]
    0x100221e3c <+68>:   str    q0, [sp, #0x250]
    0x100221e40 <+72>:   ldr    q0, [sp, #0x250]
    0x100221e44 <+76>:   str    q0, [sp, #0x550]
    0x100221e48 <+80>:   ldr    q0, [sp, #0x550]
    0x100221e4c <+84>:   str    q0, [sp, #0x540]
    0x100221e50 <+88>:   ldr    q0, [sp, #0x540]
    0x100221e54 <+92>:   str    q0, [sp, #0x290]
    0x100221e58 <+96>:   ldr    x8, [sp, #0x2b0]
    0x100221e5c <+100>:  ldr    q1, [x8]
    0x100221e60 <+104>:  ldr    q0, [sp, #0x2c0]
    0x100221e64 <+108>:  str    q0, [sp, #0x730]
    0x100221e68 <+112>:  ldr    q0, [sp, #0x730]
    0x100221e6c <+116>:  str    q0, [sp, #0x720]
    0x100221e70 <+120>:  ldr    q2, [sp, #0x720]
    0x100221e74 <+124>:  mov    w8, #0xf4
    0x100221e78 <+128>:  strb   w8, [sp, #0x87f]
    0x100221e7c <+132>:  add    x9, sp, #0x87f
    0x100221e80 <+136>:  ld1r.16b { v0 }, [x9]
    0x100221e84 <+140>:  str    q0, [sp, #0x850]
    0x100221e88 <+144>:  ldr    q0, [sp, #0x850]
    0x100221e8c <+148>:  str    q0, [sp, #0x860]
    0x100221e90 <+152>:  ldr    q0, [sp, #0x860]
    0x100221e94 <+156>:  str    q2, [sp, #0x7e0]
    0x100221e98 <+160>:  str    q0, [sp, #0x7d0]
    0x100221e9c <+164>:  ldr    q0, [sp, #0x7e0]
    0x100221ea0 <+168>:  ldr    q2, [sp, #0x7d0]
    0x100221ea4 <+172>:  uqsub.16b v0, v0, v2
    0x100221ea8 <+176>:  str    q0, [sp, #0x7c0]
    0x100221eac <+180>:  ldr    q0, [sp, #0x7c0]
    0x100221eb0 <+184>:  str    q0, [sp, #0x530]
    0x100221eb4 <+188>:  ldr    q0, [sp, #0x530]
    0x100221eb8 <+192>:  str    q0, [sp, #0x520]
    0x100221ebc <+196>:  ldr    q0, [sp, #0x520]
    0x100221ec0 <+200>:  str    q1, [sp, #0x470]
    0x100221ec4 <+204>:  str    q0, [sp, #0x460]
    0x100221ec8 <+208>:  ldr    q0, [sp, #0x470]
    0x100221ecc <+212>:  ldr    q1, [sp, #0x460]
    0x100221ed0 <+216>:  orr.16b v0, v0, v1
    0x100221ed4 <+220>:  str    q0, [sp, #0x450]
    0x100221ed8 <+224>:  ldr    q0, [sp, #0x450]
    0x100221edc <+228>:  ldr    x9, [sp, #0x2b0]
    0x100221ee0 <+232>:  str    q0, [x9]
    0x100221ee4 <+236>:  adrp   x9, 3566
    0x100221ee8 <+240>:  add    x9, x9, #0x270            ; neon_check_utf8_bytes._nibbles
    0x100221eec <+244>:  ldr    q0, [x9]
    0x100221ef0 <+248>:  str    q0, [sp, #0x230]
    0x100221ef4 <+252>:  ldr    q0, [sp, #0x230]
    0x100221ef8 <+256>:  str    q0, [sp, #0x220]
    0x100221efc <+260>:  ldr    q1, [sp, #0x220]
    0x100221f00 <+264>:  ldr    q0, [sp, #0x290]
    0x100221f04 <+268>:  str    q0, [sp, #0x710]
    0x100221f08 <+272>:  ldr    q0, [sp, #0x710]
    0x100221f0c <+276>:  str    q0, [sp, #0x700]
    0x100221f10 <+280>:  ldr    q0, [sp, #0x700]
    0x100221f14 <+284>:  str    q1, [sp, #0x900]
    0x100221f18 <+288>:  str    q0, [sp, #0x8f0]
    0x100221f1c <+292>:  ldr    q0, [sp, #0x900]
    0x100221f20 <+296>:  ldr    q1, [sp, #0x8f0]
    0x100221f24 <+300>:  tbl.16b v0, { v0 }, v1
    0x100221f28 <+304>:  str    q0, [sp, #0x8e0]
    0x100221f2c <+308>:  ldr    q0, [sp, #0x8e0]
    0x100221f30 <+312>:  str    q0, [sp, #0x240]
    0x100221f34 <+316>:  ldr    x9, [sp, #0x2b8]
    0x100221f38 <+320>:  ldr    q0, [x9, #0x20]
    0x100221f3c <+324>:  str    q0, [sp, #0x1f0]
    0x100221f40 <+328>:  ldr    q0, [sp, #0x240]
    0x100221f44 <+332>:  str    q0, [sp, #0x1e0]
    0x100221f48 <+336>:  ldr    q0, [sp, #0x1f0]
    0x100221f4c <+340>:  ldr    q1, [sp, #0x1e0]
    0x100221f50 <+344>:  ext.16b v0, v0, v1, #0xf
    0x100221f54 <+348>:  str    q0, [sp, #0x200]
    0x100221f58 <+352>:  ldr    q0, [sp, #0x200]
    0x100221f5c <+356>:  str    q0, [sp, #0x1d0]
    0x100221f60 <+360>:  ldr    q0, [sp, #0x1d0]
    0x100221f64 <+364>:  str    q0, [sp, #0x6f0]
    0x100221f68 <+368>:  ldr    q0, [sp, #0x6f0]
    0x100221f6c <+372>:  str    q0, [sp, #0x6e0]
    0x100221f70 <+376>:  ldr    q1, [sp, #0x6e0]
    0x100221f74 <+380>:  mov    w9, #0x1
    0x100221f78 <+384>:  strb   w9, [sp, #0x84f]
    0x100221f7c <+388>:  add    x9, sp, #0x84f
    0x100221f80 <+392>:  ld1r.16b { v0 }, [x9]
    0x100221f84 <+396>:  str    q0, [sp, #0x820]
    0x100221f88 <+400>:  ldr    q0, [sp, #0x820]
    0x100221f8c <+404>:  str    q0, [sp, #0x830]
    0x100221f90 <+408>:  ldr    q0, [sp, #0x830]
    0x100221f94 <+412>:  str    q1, [sp, #0x7b0]
    0x100221f98 <+416>:  str    q0, [sp, #0x7a0]
    0x100221f9c <+420>:  ldr    q0, [sp, #0x7b0]
    0x100221fa0 <+424>:  ldr    q1, [sp, #0x7a0]
    0x100221fa4 <+428>:  uqsub.16b v0, v0, v1
    0x100221fa8 <+432>:  str    q0, [sp, #0x790]
    0x100221fac <+436>:  ldr    q0, [sp, #0x790]
    0x100221fb0 <+440>:  str    q0, [sp, #0x510]
    0x100221fb4 <+444>:  ldr    q0, [sp, #0x510]
    0x100221fb8 <+448>:  str    q0, [sp, #0x500]
    0x100221fbc <+452>:  ldr    q0, [sp, #0x500]
    0x100221fc0 <+456>:  str    q0, [sp, #0x210]
    0x100221fc4 <+460>:  ldr    q1, [sp, #0x240]
    0x100221fc8 <+464>:  ldr    q0, [sp, #0x210]
    0x100221fcc <+468>:  str    q1, [sp, #0x960]
    0x100221fd0 <+472>:  str    q0, [sp, #0x950]
    0x100221fd4 <+476>:  ldr    q0, [sp, #0x960]
    0x100221fd8 <+480>:  ldr    q1, [sp, #0x950]
    0x100221fdc <+484>:  add.16b v0, v0, v1
    0x100221fe0 <+488>:  str    q0, [sp, #0x940]
    0x100221fe4 <+492>:  ldr    q0, [sp, #0x940]
    0x100221fe8 <+496>:  str    q0, [sp, #0x1c0]
    0x100221fec <+500>:  ldr    x9, [sp, #0x2b8]
    0x100221ff0 <+504>:  ldr    q0, [x9, #0x20]
    0x100221ff4 <+508>:  str    q0, [sp, #0x190]
    0x100221ff8 <+512>:  ldr    q0, [sp, #0x1c0]
    0x100221ffc <+516>:  str    q0, [sp, #0x180]
    0x100222000 <+520>:  ldr    q0, [sp, #0x190]
    0x100222004 <+524>:  ldr    q1, [sp, #0x180]
    0x100222008 <+528>:  ext.16b v0, v0, v1, #0xe
    0x10022200c <+532>:  str    q0, [sp, #0x1a0]
    0x100222010 <+536>:  ldr    q0, [sp, #0x1a0]
    0x100222014 <+540>:  str    q0, [sp, #0x170]
    0x100222018 <+544>:  ldr    q0, [sp, #0x170]
    0x10022201c <+548>:  str    q0, [sp, #0x6d0]
    0x100222020 <+552>:  ldr    q0, [sp, #0x6d0]
    0x100222024 <+556>:  str    q0, [sp, #0x6c0]
    0x100222028 <+560>:  ldr    q1, [sp, #0x6c0]
    0x10022202c <+564>:  mov    w9, #0x2
    0x100222030 <+568>:  strb   w9, [sp, #0x81f]
    0x100222034 <+572>:  add    x9, sp, #0x81f
    0x100222038 <+576>:  ld1r.16b { v0 }, [x9]
    0x10022203c <+580>:  str    q0, [sp, #0x7f0]
    0x100222040 <+584>:  ldr    q0, [sp, #0x7f0]
    0x100222044 <+588>:  str    q0, [sp, #0x800]
    0x100222048 <+592>:  ldr    q0, [sp, #0x800]
    0x10022204c <+596>:  str    q1, [sp, #0x780]
    0x100222050 <+600>:  str    q0, [sp, #0x770]
    0x100222054 <+604>:  ldr    q0, [sp, #0x780]
    0x100222058 <+608>:  ldr    q1, [sp, #0x770]
    0x10022205c <+612>:  uqsub.16b v0, v0, v1
    0x100222060 <+616>:  str    q0, [sp, #0x760]
    0x100222064 <+620>:  ldr    q0, [sp, #0x760]
    0x100222068 <+624>:  str    q0, [sp, #0x4f0]
    0x10022206c <+628>:  ldr    q0, [sp, #0x4f0]
    0x100222070 <+632>:  str    q0, [sp, #0x4e0]
    0x100222074 <+636>:  ldr    q0, [sp, #0x4e0]
    0x100222078 <+640>:  str    q0, [sp, #0x1b0]
    0x10022207c <+644>:  ldr    q1, [sp, #0x1c0]
    0x100222080 <+648>:  ldr    q0, [sp, #0x1b0]
    0x100222084 <+652>:  str    q1, [sp, #0x930]
    0x100222088 <+656>:  str    q0, [sp, #0x920]
    0x10022208c <+660>:  ldr    q0, [sp, #0x930]
    0x100222090 <+664>:  ldr    q1, [sp, #0x920]
    0x100222094 <+668>:  add.16b v0, v0, v1
    0x100222098 <+672>:  str    q0, [sp, #0x910]
    0x10022209c <+676>:  ldr    q0, [sp, #0x910]
    0x1002220a0 <+680>:  str    q0, [sp, #0x2a0]
    0x1002220a4 <+684>:  ldr    q1, [sp, #0x2a0]
    0x1002220a8 <+688>:  ldr    q0, [sp, #0x240]
    0x1002220ac <+692>:  str    q1, [sp, #0x670]
    0x1002220b0 <+696>:  str    q0, [sp, #0x660]
    0x1002220b4 <+700>:  ldr    q0, [sp, #0x670]
    0x1002220b8 <+704>:  ldr    q1, [sp, #0x660]
    0x1002220bc <+708>:  cmgt.16b v0, v0, v1
    0x1002220c0 <+712>:  str    q0, [sp, #0x650]
    0x1002220c4 <+716>:  ldr    q1, [sp, #0x650]
    0x1002220c8 <+720>:  ldr    q2, [sp, #0x240]
    0x1002220cc <+724>:  mov    w9, #0x0
    0x1002220d0 <+728>:  strb   w9, [sp, #0x3bf]
    0x1002220d4 <+732>:  add    x9, sp, #0x3bf
    0x1002220d8 <+736>:  ld1r.16b { v0 }, [x9]
    0x1002220dc <+740>:  str    q0, [sp, #0x390]
    0x1002220e0 <+744>:  ldr    q0, [sp, #0x390]
    0x1002220e4 <+748>:  str    q0, [sp, #0x3a0]
    0x1002220e8 <+752>:  ldr    q0, [sp, #0x3a0]
    0x1002220ec <+756>:  str    q2, [sp, #0x640]
    0x1002220f0 <+760>:  str    q0, [sp, #0x630]
    0x1002220f4 <+764>:  ldr    q0, [sp, #0x640]
    0x1002220f8 <+768>:  ldr    q2, [sp, #0x630]
    0x1002220fc <+772>:  cmgt.16b v0, v0, v2
    0x100222100 <+776>:  str    q0, [sp, #0x620]
    0x100222104 <+780>:  ldr    q0, [sp, #0x620]
    0x100222108 <+784>:  str    q1, [sp, #0x990]
    0x10022210c <+788>:  str    q0, [sp, #0x980]
    0x100222110 <+792>:  ldr    q0, [sp, #0x990]
    0x100222114 <+796>:  ldr    q1, [sp, #0x980]
    0x100222118 <+800>:  cmeq.16b v0, v0, v1
    0x10022211c <+804>:  str    q0, [sp, #0x970]
    0x100222120 <+808>:  ldr    q0, [sp, #0x970]
    0x100222124 <+812>:  str    q0, [sp, #0x160]
    0x100222128 <+816>:  ldr    x9, [sp, #0x2b0]
    0x10022212c <+820>:  ldr    q1, [x9]
    0x100222130 <+824>:  ldr    q0, [sp, #0x160]
    0x100222134 <+828>:  str    q0, [sp, #0x4d0]
    0x100222138 <+832>:  ldr    q0, [sp, #0x4d0]
    0x10022213c <+836>:  str    q0, [sp, #0x4c0]
    0x100222140 <+840>:  ldr    q0, [sp, #0x4c0]
    0x100222144 <+844>:  str    q1, [sp, #0x440]
    0x100222148 <+848>:  str    q0, [sp, #0x430]
    0x10022214c <+852>:  ldr    q0, [sp, #0x440]
    0x100222150 <+856>:  ldr    q1, [sp, #0x430]
    0x100222154 <+860>:  orr.16b v0, v0, v1
    0x100222158 <+864>:  str    q0, [sp, #0x420]
    0x10022215c <+868>:  ldr    q0, [sp, #0x420]
    0x100222160 <+872>:  ldr    x9, [sp, #0x2b0]
    0x100222164 <+876>:  str    q0, [x9]
    0x100222168 <+880>:  ldr    x9, [sp, #0x2b8]
    0x10022216c <+884>:  ldr    q0, [x9]
    0x100222170 <+888>:  str    q0, [sp, #0x130]
    0x100222174 <+892>:  ldr    q0, [sp, #0x280]
    0x100222178 <+896>:  str    q0, [sp, #0x120]
    0x10022217c <+900>:  ldr    q0, [sp, #0x130]
    0x100222180 <+904>:  ldr    q1, [sp, #0x120]
    0x100222184 <+908>:  ext.16b v0, v0, v1, #0xf
    0x100222188 <+912>:  str    q0, [sp, #0x140]
    0x10022218c <+916>:  ldr    q0, [sp, #0x140]
    0x100222190 <+920>:  str    q0, [sp, #0x110]
    0x100222194 <+924>:  ldr    q0, [sp, #0x110]
    0x100222198 <+928>:  str    q0, [sp, #0x150]
    0x10022219c <+932>:  ldr    q1, [sp, #0x150]
    0x1002221a0 <+936>:  mov    w9, #0xed
    0x1002221a4 <+940>:  strb   w9, [sp, #0x38f]
    0x1002221a8 <+944>:  add    x9, sp, #0x38f
    0x1002221ac <+948>:  ld1r.16b { v0 }, [x9]
    0x1002221b0 <+952>:  str    q0, [sp, #0x360]
    0x1002221b4 <+956>:  ldr    q0, [sp, #0x360]
    0x1002221b8 <+960>:  str    q0, [sp, #0x370]
    0x1002221bc <+964>:  ldr    q0, [sp, #0x370]
    0x1002221c0 <+968>:  str    q1, [sp, #0x9f0]
    0x1002221c4 <+972>:  str    q0, [sp, #0x9e0]
    0x1002221c8 <+976>:  ldr    q0, [sp, #0x9f0]
    0x1002221cc <+980>:  ldr    q1, [sp, #0x9e0]
    0x1002221d0 <+984>:  cmeq.16b v0, v0, v1
    0x1002221d4 <+988>:  str    q0, [sp, #0x9d0]
    0x1002221d8 <+992>:  ldr    q0, [sp, #0x9d0]
    0x1002221dc <+996>:  str    q0, [sp, #0x100]
    0x1002221e0 <+1000>: ldr    q1, [sp, #0x150]
    0x1002221e4 <+1004>: strb   w8, [sp, #0x35f]
    0x1002221e8 <+1008>: add    x8, sp, #0x35f
    0x1002221ec <+1012>: ld1r.16b { v0 }, [x8]
    0x1002221f0 <+1016>: str    q0, [sp, #0x330]
    0x1002221f4 <+1020>: ldr    q0, [sp, #0x330]
    0x1002221f8 <+1024>: str    q0, [sp, #0x340]
    0x1002221fc <+1028>: ldr    q0, [sp, #0x340]
    0x100222200 <+1032>: str    q1, [sp, #0x9c0]
    0x100222204 <+1036>: str    q0, [sp, #0x9b0]
    0x100222208 <+1040>: ldr    q0, [sp, #0x9c0]
    0x10022220c <+1044>: ldr    q1, [sp, #0x9b0]
    0x100222210 <+1048>: cmeq.16b v0, v0, v1
    0x100222214 <+1052>: str    q0, [sp, #0x9a0]
    0x100222218 <+1056>: ldr    q0, [sp, #0x9a0]
    0x10022221c <+1060>: str    q0, [sp, #0xf0]
    0x100222220 <+1064>: ldr    q1, [sp, #0x2c0]
    0x100222224 <+1068>: mov    w8, #0x9f
    0x100222228 <+1072>: strb   w8, [sp, #0x32f]
    0x10022222c <+1076>: add    x8, sp, #0x32f
    0x100222230 <+1080>: ld1r.16b { v0 }, [x8]
    0x100222234 <+1084>: str    q0, [sp, #0x300]
    0x100222238 <+1088>: ldr    q0, [sp, #0x300]
    0x10022223c <+1092>: str    q0, [sp, #0x310]
    0x100222240 <+1096>: ldr    q0, [sp, #0x310]
    0x100222244 <+1100>: str    q1, [sp, #0x610]
    0x100222248 <+1104>: str    q0, [sp, #0x600]
    0x10022224c <+1108>: ldr    q0, [sp, #0x610]
    0x100222250 <+1112>: ldr    q1, [sp, #0x600]
    0x100222254 <+1116>: cmgt.16b v0, v0, v1
    0x100222258 <+1120>: str    q0, [sp, #0x5f0]
    0x10022225c <+1124>: ldr    q1, [sp, #0x5f0]
    0x100222260 <+1128>: ldr    q0, [sp, #0x100]
    0x100222264 <+1132>: str    q1, [sp, #0xa80]
    0x100222268 <+1136>: str    q0, [sp, #0xa70]
    0x10022226c <+1140>: ldr    q0, [sp, #0xa80]
    0x100222270 <+1144>: ldr    q1, [sp, #0xa70]
    0x100222274 <+1148>: and.16b v0, v0, v1
    0x100222278 <+1152>: str    q0, [sp, #0xa60]
    0x10022227c <+1156>: ldr    q0, [sp, #0xa60]
    0x100222280 <+1160>: str    q0, [sp, #0xe0]
    0x100222284 <+1164>: ldr    q1, [sp, #0x2c0]
    0x100222288 <+1168>: mov    w8, #0x8f
    0x10022228c <+1172>: strb   w8, [sp, #0x2ff]
    0x100222290 <+1176>: add    x8, sp, #0x2ff
    0x100222294 <+1180>: ld1r.16b { v0 }, [x8]
    0x100222298 <+1184>: str    q0, [sp, #0x2d0]
    0x10022229c <+1188>: ldr    q0, [sp, #0x2d0]
    0x1002222a0 <+1192>: str    q0, [sp, #0x2e0]
    0x1002222a4 <+1196>: ldr    q0, [sp, #0x2e0]
    0x1002222a8 <+1200>: str    q1, [sp, #0x5e0]
    0x1002222ac <+1204>: str    q0, [sp, #0x5d0]
    0x1002222b0 <+1208>: ldr    q0, [sp, #0x5e0]
    0x1002222b4 <+1212>: ldr    q1, [sp, #0x5d0]
    0x1002222b8 <+1216>: cmgt.16b v0, v0, v1
    0x1002222bc <+1220>: str    q0, [sp, #0x5c0]
    0x1002222c0 <+1224>: ldr    q1, [sp, #0x5c0]
    0x1002222c4 <+1228>: ldr    q0, [sp, #0xf0]
    0x1002222c8 <+1232>: str    q1, [sp, #0xa50]
    0x1002222cc <+1236>: str    q0, [sp, #0xa40]
    0x1002222d0 <+1240>: ldr    q0, [sp, #0xa50]
    0x1002222d4 <+1244>: ldr    q1, [sp, #0xa40]
    0x1002222d8 <+1248>: and.16b v0, v0, v1
    0x1002222dc <+1252>: str    q0, [sp, #0xa30]
    0x1002222e0 <+1256>: ldr    q0, [sp, #0xa30]
    0x1002222e4 <+1260>: str    q0, [sp, #0xd0]
    0x1002222e8 <+1264>: ldr    x8, [sp, #0x2b0]
    0x1002222ec <+1268>: ldr    q1, [x8]
    0x1002222f0 <+1272>: ldr    q2, [sp, #0xe0]
    0x1002222f4 <+1276>: ldr    q0, [sp, #0xd0]
    0x1002222f8 <+1280>: str    q2, [sp, #0xab0]
    0x1002222fc <+1284>: str    q0, [sp, #0xaa0]
    0x100222300 <+1288>: ldr    q0, [sp, #0xab0]
    0x100222304 <+1292>: ldr    q2, [sp, #0xaa0]
    0x100222308 <+1296>: orr.16b v0, v0, v2
    0x10022230c <+1300>: str    q0, [sp, #0xa90]
    0x100222310 <+1304>: ldr    q0, [sp, #0xa90]
    0x100222314 <+1308>: str    q0, [sp, #0x4b0]
    0x100222318 <+1312>: ldr    q0, [sp, #0x4b0]
    0x10022231c <+1316>: str    q0, [sp, #0x4a0]
    0x100222320 <+1320>: ldr    q0, [sp, #0x4a0]
    0x100222324 <+1324>: str    q1, [sp, #0x410]
    0x100222328 <+1328>: str    q0, [sp, #0x400]
    0x10022232c <+1332>: ldr    q0, [sp, #0x410]
    0x100222330 <+1336>: ldr    q1, [sp, #0x400]
    0x100222334 <+1340>: orr.16b v0, v0, v1
    0x100222338 <+1344>: str    q0, [sp, #0x3f0]
    0x10022233c <+1348>: ldr    q0, [sp, #0x3f0]
    0x100222340 <+1352>: ldr    x8, [sp, #0x2b0]
    0x100222344 <+1356>: str    q0, [x8]
    0x100222348 <+1360>: ldr    x8, [sp, #0x2b8]
    0x10022234c <+1364>: ldr    q0, [x8, #0x10]
    0x100222350 <+1368>: str    q0, [sp, #0xa0]
    0x100222354 <+1372>: ldr    q0, [sp, #0x290]
    0x100222358 <+1376>: str    q0, [sp, #0x90]
    0x10022235c <+1380>: ldr    q0, [sp, #0xa0]
    0x100222360 <+1384>: ldr    q1, [sp, #0x90]
    0x100222364 <+1388>: ext.16b v0, v0, v1, #0xf
    0x100222368 <+1392>: str    q0, [sp, #0xb0]
    0x10022236c <+1396>: ldr    q0, [sp, #0xb0]
    0x100222370 <+1400>: str    q0, [sp, #0x80]
    0x100222374 <+1404>: ldr    q0, [sp, #0x80]
    0x100222378 <+1408>: str    q0, [sp, #0xc0]
    0x10022237c <+1412>: adrp   x8, 3565
    0x100222380 <+1416>: add    x8, x8, #0x280            ; neon_check_utf8_bytes._initial_mins
    0x100222384 <+1420>: ldr    q0, [x8]
    0x100222388 <+1424>: str    q0, [sp, #0x60]
    0x10022238c <+1428>: ldr    q0, [sp, #0x60]
    0x100222390 <+1432>: str    q0, [sp, #0x50]
    0x100222394 <+1436>: ldr    q1, [sp, #0x50]
    0x100222398 <+1440>: ldr    q0, [sp, #0xc0]
    0x10022239c <+1444>: str    q0, [sp, #0x6b0]
    0x1002223a0 <+1448>: ldr    q0, [sp, #0x6b0]
    0x1002223a4 <+1452>: str    q0, [sp, #0x6a0]
    0x1002223a8 <+1456>: ldr    q0, [sp, #0x6a0]
    0x1002223ac <+1460>: str    q1, [sp, #0x8d0]
    0x1002223b0 <+1464>: str    q0, [sp, #0x8c0]
    0x1002223b4 <+1468>: ldr    q0, [sp, #0x8d0]
    0x1002223b8 <+1472>: ldr    q1, [sp, #0x8c0]
    0x1002223bc <+1476>: tbl.16b v0, { v0 }, v1
    0x1002223c0 <+1480>: str    q0, [sp, #0x8b0]
    0x1002223c4 <+1484>: ldr    q0, [sp, #0x8b0]
    0x1002223c8 <+1488>: str    q0, [sp, #0x70]
    0x1002223cc <+1492>: ldr    q1, [sp, #0x70]
    0x1002223d0 <+1496>: ldr    q0, [sp, #0x150]
    0x1002223d4 <+1500>: str    q1, [sp, #0x5b0]
    0x1002223d8 <+1504>: str    q0, [sp, #0x5a0]
    0x1002223dc <+1508>: ldr    q0, [sp, #0x5b0]
    0x1002223e0 <+1512>: ldr    q1, [sp, #0x5a0]
    0x1002223e4 <+1516>: cmgt.16b v0, v0, v1
    0x1002223e8 <+1520>: str    q0, [sp, #0x590]
    0x1002223ec <+1524>: ldr    q0, [sp, #0x590]
    0x1002223f0 <+1528>: str    q0, [sp, #0x40]
    0x1002223f4 <+1532>: adrp   x8, 3565
    0x1002223f8 <+1536>: add    x8, x8, #0x290            ; neon_check_utf8_bytes._second_mins
    0x1002223fc <+1540>: ldr    q0, [x8]
    0x100222400 <+1544>: str    q0, [sp, #0x20]
    0x100222404 <+1548>: ldr    q0, [sp, #0x20]
    0x100222408 <+1552>: str    q0, [sp, #0x10]
    0x10022240c <+1556>: ldr    q1, [sp, #0x10]
    0x100222410 <+1560>: ldr    q0, [sp, #0xc0]
    0x100222414 <+1564>: str    q0, [sp, #0x690]
    0x100222418 <+1568>: ldr    q0, [sp, #0x690]
    0x10022241c <+1572>: str    q0, [sp, #0x680]
    0x100222420 <+1576>: ldr    q0, [sp, #0x680]
    0x100222424 <+1580>: str    q1, [sp, #0x8a0]
    0x100222428 <+1584>: str    q0, [sp, #0x890]
    0x10022242c <+1588>: ldr    q0, [sp, #0x8a0]
    0x100222430 <+1592>: ldr    q1, [sp, #0x890]
    0x100222434 <+1596>: tbl.16b v0, { v0 }, v1
    0x100222438 <+1600>: str    q0, [sp, #0x880]
    0x10022243c <+1604>: ldr    q0, [sp, #0x880]
    0x100222440 <+1608>: str    q0, [sp, #0x30]
    0x100222444 <+1612>: ldr    q1, [sp, #0x30]
    0x100222448 <+1616>: ldr    q0, [sp, #0x2c0]
    0x10022244c <+1620>: str    q1, [sp, #0x580]
    0x100222450 <+1624>: str    q0, [sp, #0x570]
    0x100222454 <+1628>: ldr    q0, [sp, #0x580]
    0x100222458 <+1632>: ldr    q1, [sp, #0x570]
    0x10022245c <+1636>: cmgt.16b v0, v0, v1
    0x100222460 <+1640>: str    q0, [sp, #0x560]
    0x100222464 <+1644>: ldr    q0, [sp, #0x560]
    0x100222468 <+1648>: str    q0, [sp]
    0x10022246c <+1652>: ldr    x8, [sp, #0x2b0]
    0x100222470 <+1656>: ldr    q1, [x8]
    0x100222474 <+1660>: ldr    q2, [sp, #0x40]
    0x100222478 <+1664>: ldr    q0, [sp]
    0x10022247c <+1668>: str    q2, [sp, #0xa20]
    0x100222480 <+1672>: str    q0, [sp, #0xa10]
    0x100222484 <+1676>: ldr    q0, [sp, #0xa20]
    0x100222488 <+1680>: ldr    q2, [sp, #0xa10]
    0x10022248c <+1684>: and.16b v0, v0, v2
    0x100222490 <+1688>: str    q0, [sp, #0xa00]
    0x100222494 <+1692>: ldr    q0, [sp, #0xa00]
    0x100222498 <+1696>: str    q0, [sp, #0x490]
    0x10022249c <+1700>: ldr    q0, [sp, #0x490]
    0x1002224a0 <+1704>: str    q0, [sp, #0x480]
    0x1002224a4 <+1708>: ldr    q0, [sp, #0x480]
    0x1002224a8 <+1712>: str    q1, [sp, #0x3e0]
    0x1002224ac <+1716>: str    q0, [sp, #0x3d0]
    0x1002224b0 <+1720>: ldr    q0, [sp, #0x3e0]
    0x1002224b4 <+1724>: ldr    q1, [sp, #0x3d0]
    0x1002224b8 <+1728>: orr.16b v0, v0, v1
    0x1002224bc <+1732>: str    q0, [sp, #0x3c0]
    0x1002224c0 <+1736>: ldr    q0, [sp, #0x3c0]
    0x1002224c4 <+1740>: ldr    x8, [sp, #0x2b0]
    0x1002224c8 <+1744>: str    q0, [x8]
    0x1002224cc <+1748>: ldr    x8, [sp, #0x2b8]
    0x1002224d0 <+1752>: ldr    q0, [sp, #0x280]
    0x1002224d4 <+1756>: ldr    q1, [sp, #0x290]
    0x1002224d8 <+1760>: ldr    q2, [sp, #0x2a0]
    0x1002224dc <+1764>: str    q2, [x8, #0x20]
    0x1002224e0 <+1768>: str    q1, [x8, #0x10]
    0x1002224e4 <+1772>: str    q0, [x8]
    0x1002224e8 <+1776>: add    sp, sp, #0xac0
    0x1002224ec <+1780>: ldp    x28, x27, [sp], #0x10
    0x1002224f0 <+1784>: ret

alexdowad · 2023-04-15T15:13:14Z

@youkidearitai If you are no longer interested in this problem, that is fine; this is OSS and no developer is 'forced' to work on something they don't want to work on.

However, if you are still interested in this problem, but are closing the PR because figuring out what is going on seems difficult, I would encourage you to give it more time. Technical challenges which seem overwhelming can often be overcome if one is persistent and tries different approaches until something works. And if you do so, you may find that you learn a lot in the process.

Up to you either way.

youkidearitai · 2023-04-15T15:20:40Z

@alexdowad Thank you very much. I don't wanna give up.

Up to you either way.

Thanks again, Please give me a little more time.

alexdowad · 2023-04-16T15:24:46Z

@youkidearitai I'm glad to hear you are hoping to give this PR more time. I think it will be very valuable.

As mentioned above, I can't directly help you on this, but I can give you ideas of what to try.

I think a good first step would be to compile @cyb70289's benchmarking code on your machine, run it, and see if you get results comparable to theirs or not.

youkidearitai · 2023-04-18T05:08:58Z

@alexdowad Thanks for advice! I take benchmark below.

Base code: https://gist.github.com/youkidearitai/7cd8771f6f6e40e21708129707b40204

(master is 5823955)

I think M1 mac is originally fast, Raspberry Pi SoC is particularly effective (2.4x faster).

M1 macOS

master

MacBook-Air:youkidearitai-php-src tekimen$ sapi/cli/php long-utf-8-bench.php
time: 914805542
MacBook-Air:youkidearitai-php-src tekimen$ sapi/cli/php long-utf-8-bench.php
time: 748921167
MacBook-Air:youkidearitai-php-src tekimen$ sapi/cli/php long-utf-8-bench.php
time: 756241666
MacBook-Air:youkidearitai-php-src tekimen$ sapi/cli/php long-utf-8-bench.php
time: 747524042
MacBook-Air:youkidearitai-php-src tekimen$ sapi/cli/php long-utf-8-bench.php
time: 751337333

average is 783765950

neonutf8

MacBook-Air:youkidearitai-php-src tekimen$ sapi/cli/php long-utf-8-bench.php
time: 620495000
MacBook-Air:youkidearitai-php-src tekimen$ sapi/cli/php long-utf-8-bench.php
time: 414064041
MacBook-Air:youkidearitai-php-src tekimen$ sapi/cli/php long-utf-8-bench.php
time: 419568708
MacBook-Air:youkidearitai-php-src tekimen$ sapi/cli/php long-utf-8-bench.php
time: 427726542
MacBook-Air:youkidearitai-php-src tekimen$ sapi/cli/php long-utf-8-bench.php
time: 426510834

average is 461673025

Result

neonutf8 / master is 1.69766459714643x faster.

Raspbian on Raspberry Pi 4B+

master

tekimen@raspberrypi:~/src/youkidearitai_php-src $ sapi/cli/php long-utf-8-bench.php
time: 4569498667
tekimen@raspberrypi:~/src/youkidearitai_php-src $ sapi/cli/php long-utf-8-bench.php
time: 4575386789
tekimen@raspberrypi:~/src/youkidearitai_php-src $ sapi/cli/php long-utf-8-bench.php
time: 4573909173
tekimen@raspberrypi:~/src/youkidearitai_php-src $ sapi/cli/php long-utf-8-bench.php
time: 4573393812
tekimen@raspberrypi:~/src/youkidearitai_php-src $ sapi/cli/php long-utf-8-bench.php
time: 4577176635

neonutf8

tekimen@raspberrypi:~/src/youkidearitai_php-src $ sapi/cli/php long-utf-8-bench.php
time: 1690266904
tekimen@raspberrypi:~/src/youkidearitai_php-src $ sapi/cli/php long-utf-8-bench.php
time: 1689829703
tekimen@raspberrypi:~/src/youkidearitai_php-src $ sapi/cli/php long-utf-8-bench.php
time: 1697861673
tekimen@raspberrypi:~/src/youkidearitai_php-src $ sapi/cli/php long-utf-8-bench.php
time: 1690421242
tekimen@raspberrypi:~/src/youkidearitai_php-src $ sapi/cli/php long-utf-8-bench.php
time: 1689980379

Result

master / neonutf8 is 2.41915824648452x faster.

youkidearitai · 2023-04-18T05:15:00Z

I'm reading on Arm Neon Intrinsics Reference (PDF file) that studying more.

alexdowad · 2023-04-18T06:56:01Z

@youkidearitai So if I have this right, it looks like your Mac M1 was able to process 1.4GB (14000 byte file repeated 100,000 times) of UTF-8 text in 461ms. Is that right? If so, your computer was able to process 2896MB/sec (1024 * 1024 bytes / MB).

Looks like your computer is a lot faster than @cyb70289's.

@cyb70289 found that his 'range2' NEON code was 2.8 times faster than his 'naive' code. But maybe our scalar validation function might be faster than that 'naive' one.

When I have a bit of time I may try benchmarking to see whether that is true. If so, it would explain the difference between your results and @cyb70289's results.

cyb70289 · 2023-04-18T06:58:46Z

Apple M1 is indeed much faster than the machines I used.

alexdowad · 2023-04-18T07:08:26Z

Apple M1 is indeed much faster than the machines I used.

Indeed, thanks for the comment.
We are just wondering if we are using your code correctly, since the performance boost from our existing scalar code seems smaller than expected. It could be that we are doing something wrong and this is reducing performance.

cyb70289 · 2023-04-18T07:14:27Z

I'm not sure of the your use case. Just want to mention that if the strings to be verified contain mostly ascii chars, with few multi byte chars, this library is not good for this condition. It may hurt performance.

youkidearitai · 2023-04-18T09:52:27Z

@cyb70289 Thanks for comment. I used your code as a reference. thank you again.

if the strings to be verified contain mostly ascii chars

I will consider the logic to determine whether it is ASCII

When I have a bit of time I may try benchmarking to see whether that is true. If so, it would explain the difference between your results and @cyb70289's results.

@alexdowad Okay, I'll try to benchmark.

if all registers (16 bytes) lower than 0x7F, assumed to be ASCII.

youkidearitai · 2023-04-18T17:41:26Z

ASCII logic included. but JST is 2:30. After sleep then benchmark further.

$ sapi/cli/php long-utf-8-bench.php
bool(true)
time: 461009833
MacBook-Air:youkidearitai-php-src tekimen$ sapi/cli/php long-utf-8-bench.php
bool(true)
time: 334384125
MacBook-Air:youkidearitai-php-src tekimen$ sapi/cli/php long-utf-8-bench.php
bool(true)
time: 331224667
MacBook-Air:youkidearitai-php-src tekimen$ sapi/cli/php long-utf-8-bench.php
bool(true)
time: 326196375
MacBook-Air:youkidearitai-php-src tekimen$ sapi/cli/php long-utf-8-bench.php
bool(true)
time: 320748542
MacBook-Air:youkidearitai-php-src tekimen$ sapi/cli/php long-utf-8-bench.php
bool(true)
time: 322179833

Lower than 0x7F that all bytes SIMD register, then reset previous struct.

alexdowad · 2023-04-22T14:27:09Z

@cyb70289 From the git commit logs in your repository, it looks like you and @easyaspi314 are the authors of the NEON-accelerated range2 UTF-8 validation implementation.

A question, please... in order for more people to benefit from your work, would you be willing to give permission for your code to be incorporated in the PHP codebase and distributed under the PHP license? I don't expect that you will allow this, but if you do, it would be appreciated. (Of course, code comments would be included identifying you as the authors and pointing readers to the original code repository.)

cyb70289 · 2023-04-22T22:42:05Z

@alexdowad , I'm glad you find my utf8 library useful. It's okay to use it in php under php license.

youkidearitai · 2023-04-23T04:51:39Z

Investigation and my opinion.

One of use case of mb_check_encoding that "name" is valid UTF-8 on contact form.
As a example, my real name is 濱田侑弥, (濱田 (Hamada) is last name, 侑弥 (Yuya) is first name) that byte length is 12 bytes.
It may be a small number of bytes when checking for UTF-8 when entering a name in the contact form.

Lemire UTF-8 validation of weak point is "less than 16 bytes". But thinking of use case is often when small number of byte.
@alexdowad used SSE2 validation is "dirty hack", but NEON is not find similar Intrinsic. Therefore used memset and memcpy that is slow.
For range validation is fallback to naive when less than 16 bytes. Performance is not too bad.

Therefore, I want to use range algorithm.

cyb70289 · 2023-04-23T06:17:34Z

Just FYI, simdjson (also from Lemire) implements a utf-8 validation said to be much faster than other libraries.
https://github.com/simdjson/simdjson/blob/master/doc/basics.md#utf-8-validation-alone
I didn't evaluate it. Might deserve give it a go.

alexdowad · 2023-04-23T07:33:50Z

@cyb70289 Thanks for pointing us to simdjson. This is the actual implementation of UTF-8 validation in that library: https://github.com/simdjson/simdjson/blob/d4ac1b51d0aeb2d4f792136fe7792de709006afa/src/generic/stage1/utf8_lookup4_algorithm.h

It's using Lemire's algorithm, same as simdutf.

If someone is interested in benchmarking it, that might be interesting, but (at the moment) I don't see any reason to suspect that it will be faster than your implementation of Lemire's algorithm or your range/range2 algorithms.

alexdowad · 2023-04-23T07:42:05Z

Investigation and my opinion.

Thanks for those good points. Please note that you can easily add an if clause to any SIMD function to make it fall back to the scalar version if the input is small. However, if you prefer using range rather than range2, that is fine.

@cyb70289 has kindly given permission for his code to be distributed under the PHP license. Both the range and range2 code includes contributions from @easyaspi314, so it would be nice to hear from him/her as well.

In the meantime, @youkidearitai, I would suggest you try importing the range/range2 NEON code (whichever you choose) and start testing.

There is another important issue which needs to be addressed here, but first let's just confirm that everything works fine when range/range2 is imported into mbstring.

easyaspi314 · 2023-04-23T15:04:43Z

Both the range and range2 code includes contributions from @easyaspi314, so it would be nice to hear from him/her as well.

I honestly forgot I wrote this code lol 😅

I give my permission to use it.

youkidearitai · 2023-04-23T16:55:09Z

@cyb70289 @easyaspi314 Thanks for approved to use to algorithm.
I implementation range algorithm, testing that.

@cyb70289

from https://github.com/cyb70289/utf8 Thanks for @cyb70289 and @easyaspi314

youkidearitai · 2023-04-25T06:43:45Z

fixed: I tried compile on Raspberry Pi 1 that using utf8_naive function. I find to miss to semicolon;.

youkidearitai · 2023-04-27T16:02:17Z

M1 mac benchmark

neonutf8 brnach

MacBook-Air:youkidearitai-php-src tekimen$ sapi/cli/php utf8-bench.php
bool(true)
time: 6167

master branch

MacBook-Air:youkidearitai-php-src tekimen$ sapi/cli/php utf8-bench.php
bool(true)
time: 48542

48542 / 6167 = 7.8712502026x faster.

Raspberry Pi 4B+ benchmark

neonutf8 branch

tekimen@raspberrypi:~/src/youkidearitai_php-src $ sapi/cli/php utf8_bench.php
bool(true)
time: 58592

master branch

tekimen@raspberrypi:~/src/youkidearitai_php-src $ sapi/cli/php utf8_bench.php
bool(true)
time: 102887

102887 / 58592 = 1.7559905789x faster

easyaspi314 · 2023-04-29T00:46:33Z

On second look at this code (and the original), there is a major problem, there is no short-circuit.

If there is an error at the beginning of a very long string, it would still go through the entire string, forcing a full O(n) check.

Perhaps the vmaxv_u8 check could be in a bigger loop that loops for every 64 or 128 bytes? Checking every 16 bytes is bad, as NEON benefits a lot from loading a bunch of vectors at once thanks to ldp and the ridiculous number of registers, and that is only possible if the compiler can determine that the loop will run a multiple of n times

youkidearitai · 2023-04-29T02:43:20Z

@easyaspi314 Thank you very much!

Perhaps the vmaxv_u8 check could be in a bigger loop that loops for every 64 or 128 bytes? Checking every 16 bytes is bad, as NEON benefits a lot from loading a bunch of vectors at once thanks to ldp and the ridiculous number of registers, and that is only possible if the compiler can determine that the loop will run a multiple of n times

I read https://developer.arm.com/documentation/ddi0596/2021-03/SIMD-FP-Instructions/UMAXV--Unsigned-Maximum-across-Vector-?lang=en. Certainly it seems that it can not be decided unless it is compiled.

For example, Can we use vaddlvq_u8 instead of vmaxv_u8 ? like below.

if (vaddlvq_u8(vmaxq_u8(error1, error2)) != 0) {
    return false; /* invalid UTF-8 */
}

prev_input = input;
/* omit, next 16 bytes check */

easyaspi314 · 2023-04-29T04:53:44Z

Well come to think of it, if we are going to be checking it frequently in the main loop, I could be more optimal.

The timing for the reducing instructions is pretty bad so instead of vmaxvq we should use the pairwise vpmaxq and then extract the low 64 bits. This has the most optimal timing.
(However, I could probably micro-optimize it further 😜)

/* Merge the error vectors */
uint8x16_t error = vorrq_u8(error1, error2);
/*
 * Take the max of each adjacent element, selecting the errors (0xFF) into
 * the low 8 elements of the vector. The upper bits are ignored.
 */
uint8x16_t error_paired = vpmaxq_u8(error, error);
/* Extract the raw bit pattern of the low 8 elements. */
uint64_t error_raw = vgetq_lane_u64(vreinterpretq_u64_u8(error_paired), 0);
/* If any bits are nonzero, there is an error. */
if (error_raw != 0) {
    return false;
}

This avoids the pipeline-stalling umaxv.

orr     v0.16b, v0.16b, v1.16b
umaxp   v0.16b, v0.16b, v0.16b
fmov    x0, d0
cbnz    x0, .Lfalse

Edit: also len should also be size_t or things will explode on strings > 2 GiB.

Use vpmaxq instead of vmaxvq and extract low 64 bits that optimal timings and avoid pipeline-stalling umaxv.

youkidearitai · 2023-04-29T09:24:39Z

@easyaspi314 Thank you very much for advice. I pushed your code.

Edit: also len should also be size_t or things will explode on strings > 2 GiB.

Thanks again. I fixed missed it 😂

easyaspi314 · 2023-04-29T12:57:27Z

I did some on-device benchmarking and checking every 64 bytes is about 15% faster on my Tensor G1 (Cortex-X1) with clang-16 -O3. The Cortex-A53 also performs slightly better, but that seems to be solely due to branch overhead - tbl seemingly cannot be dual issued so the ILP of unrolling has no benefit.

It also fixes an endianness bug because some people like to see things burn.

I will make a PR for the original repo later today if you want to take it from there but this is the gist:

#define PROCESS_NEON(num_bytes) \
   do { \
       /* Avoid a dependency on other iterations */ \
       uint8x16_t error1 = vdupq_n_u8(0); \
       uint8x16_t error2 = vdupq_n_u8(0); \
       size_t num_iters = num_bytes / sizeof(uint8x16_t); \
       /* Parse a block of data, marking any errors in error1 and error2 */ \
       for (size_t i = 0; i < num_iters; i++) { \
           (parsing code) \
       } \
       /* Check the error flags */ \
       (Test error flags) \
   } while (0)

/* How much data to process before checking the error flag. */
size_t block_size = 4 * sizeof(uint8x16_t); /* 64 bytes */

/* Process 64 bytes at a time */
while (len >= block_size) {
        PROCESS_NEON(block_size);
}
/* Process the remaining data */
if (len >= sizeof(uint8x16_t)) {
    PROCESS_NEON(len);
}

/* Check if in the middle of a sequence */
if (len) {
    const int8_t *token = (const int8_t *)(data - 3);
    size_t lookahead = 0;
    if (token[2] > (int8_t)0xBF) {
        lookahead = 1;
    } else if (token[1] > (int8_t)0xBF) {
        lookahead = 2;
    } else if (token[0] > (int8_t)0xBF) {
        lookahead = 3;
    }

    data -= lookahead;
    len += lookahead;
}

easyaspi314 · 2023-04-29T18:42:12Z

...

utf8_naive: 2664.8 MiB/s
utf8_range (64 byte blocks): 5560.1 MiB/s
utf8_range (16 byte blocks): 4749.9 MiB/s
simdjson::validate_utf8: 37249.7 MiB/s

Something tells me that is a better option... 😅

Although simdjson is apache 2.0 and written in C++ so that might be a problem.

It also doesn't seem to short circuit but I don't think it needs to at that speed.

alexdowad · 2023-04-29T18:51:50Z

...
utf8_naive: 2664.8 MiB/s
utf8_range (64 byte blocks): 5560.1 MiB/s
utf8_range (16 byte blocks): 4749.9 MiB/s
simdjson::validate_utf8: 37249.7 MiB/s
Something tells me that is a better option... sweat_smile

Although simdjson is apache 2.0 and written in C++ so that might be a problem.

It also doesn't seem to short circuit but I don't think it needs to

Wow!!

I guess we need to look more carefully at simdjson and figure out what their secret is.

I gave it a cursory look-over, but it appeared to just be an implementation of the same Lemire algorithm. Not sure what I missed.

youkidearitai · 2023-04-29T21:16:14Z

wow...!
I want to know why simdjson is very fast. I'll investigate. But C++ is hard to read...

youkidearitai · 2023-04-29T22:49:41Z

memo: I running simdjson on GDB, it seems used multiple chunks, possibly is it reason why fast?

3126            } else SIMDJSON_IF_CONSTEXPR (simd8x64<uint8_t>::NUM_CHUNKS == 2) {
3127              this->check_utf8_bytes(input.chunks[0], this->prev_input_block);
3128              this->check_utf8_bytes(input.chunks[1], input.chunks[0]);
3129            } else SIMDJSON_IF_CONSTEXPR (simd8x64<uint8_t>::NUM_CHUNKS == 4) {
3130              this->check_utf8_bytes(input.chunks[0], this->prev_input_block);
3131              this->check_utf8_bytes(input.chunks[1], input.chunks[0]);
3132              this->check_utf8_bytes(input.chunks[2], input.chunks[1]);
3133              this->check_utf8_bytes(input.chunks[3], input.chunks[2]);
3134            }
3135            this->prev_incomplete = is_incomplete(input.chunks[simd8x64<uint8_t>::NUM_CHUNKS-1]);
(gdb) p input
$3 = (const simdjson::arm64::(anonymous namespace)::simd::simd8x64<unsigned char> &) @0x7ffffff140: {static NUM_CHUNKS = <optimized out>,
  chunks = {
    {<simdjson::arm64::(anonymous namespace)::simd::base_u8<unsigned char, simdjson::arm64::(anonymous namespace)::simd::simd8<bool> >> = {
        value = {109, 97, 202, 179, 107, 202, 138, 115, 32, 107, 117, 203, 144, 110, 93, 32},
        static SIZE = <optimized out>}, <No data fields>},
    {<simdjson::arm64::(anonymous namespace)::simd::base_u8<unsigned char, simdjson::arm64::(anonymous namespace)::simd::simd8<bool> >> = {
        value = {60, 104, 116, 116, 112, 58, 47, 47, 119, 119, 119, 46, 99, 108, 46, 99}, static SIZE = <optimized out>}, <No data fields>},
    {<simdjson::arm64::(anonymous namespace)::simd::base_u8<unsigned char, simdjson::arm64::(anonymous namespace)::simd::simd8<bool> >> = {
        value = {97, 109, 46, 97, 99, 46, 117, 107, 47, 126, 109, 103, 107, 50, 53, 47}, static SIZE = <optimized out>}, <No data fields>},
    {<simdjson::arm64::(anonymous namespace)::simd::base_u8<unsigned char, simdjson::arm64::(anonymous namespace)::simd::simd8<bool> >> = {
        value = {62, 32, 226, 128, 148, 32, 50, 48, 48, 50, 45, 48, 55, 45, 50, 53}, static SIZE = <optimized out>}, <No data fields>}}}
(gdb)

almost 1.48 times faster with this improvement on Raspberry Pi 4B+. but maybe this is limit of this approach.

youkidearitai · 2023-04-30T12:38:53Z

I took a benchmark https://github.com/simdutf/simdutf on Raspberry Pi 4B+ with range2 algorithm. simdutf is fast, I don't well know why fast simdutf. Give me time to understand.

Maybe zero buffer is checking ASCII (range2 include checking ASCII, speed up 1GB/s to 3GB/s).

tekimen@raspberrypi:~/src/is_utf8/benchmarks $ ./bench
random UTF-8
string size = 40096
basic_validate_utf8   0.148980 GB/s
range2                1.021823 GB/s
simdutf               1.355062 GB/s
is_utf8               1.352731 GB/s

random UTF-8
string size = 100000
basic_validate_utf8   0.149221 GB/s
range2                1.031310 GB/s
simdutf               1.373346 GB/s
is_utf8               1.372556 GB/s

random UTF-8
string size = 50000
basic_validate_utf8   0.149688 GB/s
range2                1.027117 GB/s
simdutf               1.360068 GB/s
is_utf8               1.359960 GB/s

zero buffer
string size = 40096
basic_validate_utf8   0.821548 GB/s
simdutf               8.694788 GB/s
range2                3.238876 GB/s
is_utf8               8.596844 GB/s

zero buffer
string size = 100000
basic_validate_utf8   0.841332 GB/s
simdutf               9.061406 GB/s
range2                3.325850 GB/s
is_utf8               9.017293 GB/s

zero buffer
string size = 50000
basic_validate_utf8   0.812100 GB/s
simdutf               8.889932 GB/s
range2                3.281101 GB/s
is_utf8               8.825350 GB/s

easyaspi314 · 2023-04-30T13:26:20Z

is_utf8, simdutf, and simdjson all use the same code for UTF-8 validation. Just with a different namespace.

Also yes there is a check to determine if a block is entirely ASCII which lets the code fly by all the twiddling and stuff. That is the reason it gets 9 GB/s (or 37 GB/s in my case) on an ASCII only file.

youkidearitai · 2023-05-09T12:54:17Z

@easyaspi314 Thanks for advice.

I try more efficient performance bring little faster. I efficient performance ASCII check.

I took benchmark using to simdutf on Raspberry Pi 4B+. If want to speed, I need an any idea🙇

tekimen@raspberrypi:~/src/is_utf8/benchmarks $ ./bench
random UTF-8
string size = 40096
basic_validate_utf8   0.148469 GB/s
range2                0.877376 GB/s
simdutf               1.339073 GB/s
is_utf8               1.340687 GB/s

random UTF-8
string size = 100000
basic_validate_utf8   0.148266 GB/s
range2                0.967565 GB/s
simdutf               1.366771 GB/s
is_utf8               1.367743 GB/s

random UTF-8
string size = 50000
basic_validate_utf8   0.148565 GB/s
range2                0.967007 GB/s
simdutf               1.351044 GB/s
is_utf8               1.355122 GB/s

zero buffer
string size = 40096
basic_validate_utf8   0.597182 GB/s
simdutf               8.341439 GB/s
range2                7.210790 GB/s
is_utf8               8.304694 GB/s

zero buffer
string size = 100000
basic_validate_utf8   0.597918 GB/s
simdutf               8.858230 GB/s
range2                7.608631 GB/s
is_utf8               8.843376 GB/s

zero buffer
string size = 50000
basic_validate_utf8   0.597321 GB/s
simdutf               8.610818 GB/s
range2                7.494379 GB/s
is_utf8               8.606493 GB/s

UTF-8 Optimization for NEON

f0bd3c7

However, it is not seems speed accelation on M1 macOS...

youkidearitai requested a review from alexdowad as a code owner April 14, 2023 17:44

github-actions bot added the Extension: mbstring label Apr 14, 2023

youkidearitai closed this Apr 14, 2023

youkidearitai reopened this Apr 15, 2023

cast deleted

13b4162

youkidearitai added 2 commits April 19, 2023 01:30

[WIP] ASCII check

5ed2099

Add all NEON registers if ASCII, then bad byte is drop false

ac74550

if all registers (16 bytes) lower than 0x7F, assumed to be ASCII.

youkidearitai force-pushed the neonutf8 branch 3 times, most recently from 40ecd32 to 7101d42 Compare April 18, 2023 17:36

youkidearitai force-pushed the neonutf8 branch from 7101d42 to 36ca68e Compare April 18, 2023 20:55

Add ASCII check for NEON UTF-8 check.

0bd4ac0

Lower than 0x7F that all bytes SIMD register, then reset previous struct.

youkidearitai force-pushed the neonutf8 branch from 36ca68e to 0bd4ac0 Compare April 18, 2023 21:23

UTF-8 validation transplantation to range

9d54c89

from https://github.com/cyb70289/utf8 Thanks for @cyb70289 and @easyaspi314

youkidearitai force-pushed the neonutf8 branch from 0b18461 to 9d54c89 Compare April 24, 2023 22:12

Adjust PHP coding style

89b7d75

youkidearitai added 2 commits April 29, 2023 17:50

fix length data type

80f9d93

UTF-8 checking in main loop.

4a2bd6a

Use vpmaxq instead of vmaxvq and extract low 64 bits that optimal timings and avoid pipeline-stalling umaxv.

Range spread 64 bits.

ec0028e

almost 1.48 times faster with this improvement on Raspberry Pi 4B+. but maybe this is limit of this approach.

Add ASCII check for mb_check_encoding when UTF-8

4dfbefb

youkidearitai force-pushed the neonutf8 branch from 3dbee15 to 4dfbefb Compare May 16, 2023 09:29

UTF-8 Validation optimization for NEON using mb_check_encoding #11076

Are you sure you want to change the base?

UTF-8 Validation optimization for NEON using mb_check_encoding #11076

Conversation

youkidearitai commented Apr 14, 2023

alexdowad commented Apr 14, 2023

alexdowad commented Apr 14, 2023

youkidearitai commented Apr 14, 2023

Girgias commented Apr 14, 2023

youkidearitai commented Apr 14, 2023

alexdowad commented Apr 15, 2023

youkidearitai commented Apr 15, 2023

alexdowad commented Apr 16, 2023

youkidearitai commented Apr 18, 2023

M1 macOS

master

neonutf8

Result

Raspbian on Raspberry Pi 4B+

master

neonutf8

Result

youkidearitai commented Apr 18, 2023

alexdowad commented Apr 18, 2023

cyb70289 commented Apr 18, 2023

alexdowad commented Apr 18, 2023

cyb70289 commented Apr 18, 2023

youkidearitai commented Apr 18, 2023

youkidearitai commented Apr 18, 2023

alexdowad commented Apr 22, 2023

cyb70289 commented Apr 22, 2023

youkidearitai commented Apr 23, 2023

cyb70289 commented Apr 23, 2023

alexdowad commented Apr 23, 2023

alexdowad commented Apr 23, 2023

easyaspi314 commented Apr 23, 2023

youkidearitai commented Apr 23, 2023

youkidearitai commented Apr 25, 2023

youkidearitai commented Apr 27, 2023

M1 mac benchmark

neonutf8 brnach

master branch

Raspberry Pi 4B+ benchmark

neonutf8 branch

master branch

easyaspi314 commented Apr 29, 2023 • edited Loading

youkidearitai commented Apr 29, 2023

easyaspi314 commented Apr 29, 2023 • edited Loading

youkidearitai commented Apr 29, 2023

easyaspi314 commented Apr 29, 2023 • edited Loading

easyaspi314 commented Apr 29, 2023 • edited Loading

alexdowad commented Apr 29, 2023

youkidearitai commented Apr 29, 2023 • edited Loading

youkidearitai commented Apr 29, 2023

youkidearitai commented Apr 30, 2023

easyaspi314 commented Apr 30, 2023

youkidearitai commented May 9, 2023

easyaspi314 commented Apr 29, 2023 •

edited

Loading

easyaspi314 commented Apr 29, 2023 •

edited

Loading

easyaspi314 commented Apr 29, 2023 •

edited

Loading

easyaspi314 commented Apr 29, 2023 •

edited

Loading

youkidearitai commented Apr 29, 2023 •

edited

Loading