Skip to content

UTF-8 Validation optimization for NEON using mb_check_encoding #11076

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
wants to merge 12 commits into
base: master
Choose a base branch
from

Conversation

youkidearitai
Copy link
Contributor

Add UTF-8 Validation optimization for NEON.

However, it is not seems speed acceleration on M1 macOS...
macOS compile set to configure in CPPFLAGS='-g -O3 -Wall' then faster maybe 1.7x.
Ubuntu (GCC) compiler faster maybe 1.7x.

This Pull Request is almost copied Opensource another code.(MIT License)

refer from below:

Possibly, there may be an omission. I would be happy if you could find it.

I made it because I thought it would be good to discuss whether NEON should be included.

FYA @alexdowad @Girgias @pakutoma

However, it is not seems speed accelation on M1 macOS...
@alexdowad
Copy link
Contributor

@youkidearitai Thanks very much!!

@cmb69 @Girgias Can we import MIT-licensed code into php-src? Should we?

@youkidearitai So it looks like this code validates 16 bytes at a time. If so, a 1.7x speedup does seem less than expected.

I don't have a Mac, so I can't build this code and test it. I wonder, what happens if you build the benchmark code from cyb70289/utf8 and run it? Do you get results which are similar to theirs?

For x86 SIMD (SSE2, AVX, etc), I have found that one possible performance drain is constantly reloading constant values into an SIMD register. So I wonder if that is happening in this code or not. I can see that code like vdupq_n_u8(0xF4) is used to get an SIMD register with a constant value... does it run faster if you assign that to a variable outside the loop and then use the variable instead? (Alternatively, if you are able to read assembly code, you could look at the disassembly and see if the compiler is pulling that operation outside of the loop or not.)

@alexdowad
Copy link
Contributor

@iluuu1994 I think you are currently working on pulling my x86 SIMD UTF-8 validation routine into Zend core... please take note of what @youkidearitai is doing here as well.

@youkidearitai
Copy link
Contributor Author

@alexdowad Thank you very much. Let me do it slowly and carefully. When I was implementing it, I found a similar source and used it as a reference. I'm trying to read assembly code.

If license is problem or implementation is feels difficult, I will close this PR.

@Girgias
Copy link
Member

Girgias commented Apr 14, 2023

I think bundling MIT code is fine, but @bukka or @derickr would know better.

Also I'm currently on holiday for 3 weeks so won't be doing any reviews for the time being (except if I somehow become very bored).

@youkidearitai
Copy link
Contributor Author

It harder than I thought. I'm closing this PR. Thanks advice @alexdowad and @Girgias .

--enable-debug option is enabled, NEON to be disabled. This is seems difficult.

php`neon_check_utf8_bytes:
    0x100221df8 <+0>:    stp    x28, x27, [sp, #-0x10]!
    0x100221dfc <+4>:    sub    sp, sp, #0xac0
    0x100221e00 <+8>:    str    q0, [sp, #0x2c0]
    0x100221e04 <+12>:   str    x0, [sp, #0x2b8]
    0x100221e08 <+16>:   str    x1, [sp, #0x2b0]
->  0x100221e0c <+20>:   ldr    q0, [sp, #0x2c0]
    0x100221e10 <+24>:   str    q0, [sp, #0x280]
    0x100221e14 <+28>:   ldr    q0, [sp, #0x2c0]
    0x100221e18 <+32>:   str    q0, [sp, #0x750]
    0x100221e1c <+36>:   ldr    q0, [sp, #0x750]
    0x100221e20 <+40>:   str    q0, [sp, #0x740]
    0x100221e24 <+44>:   ldr    q0, [sp, #0x740]
    0x100221e28 <+48>:   str    q0, [sp, #0x260]
    0x100221e2c <+52>:   ldr    q0, [sp, #0x260]
    0x100221e30 <+56>:   ushr.16b v0, v0, #0x4
    0x100221e34 <+60>:   str    q0, [sp, #0x270]
    0x100221e38 <+64>:   ldr    q0, [sp, #0x270]
    0x100221e3c <+68>:   str    q0, [sp, #0x250]
    0x100221e40 <+72>:   ldr    q0, [sp, #0x250]
    0x100221e44 <+76>:   str    q0, [sp, #0x550]
    0x100221e48 <+80>:   ldr    q0, [sp, #0x550]
    0x100221e4c <+84>:   str    q0, [sp, #0x540]
    0x100221e50 <+88>:   ldr    q0, [sp, #0x540]
    0x100221e54 <+92>:   str    q0, [sp, #0x290]
    0x100221e58 <+96>:   ldr    x8, [sp, #0x2b0]
    0x100221e5c <+100>:  ldr    q1, [x8]
    0x100221e60 <+104>:  ldr    q0, [sp, #0x2c0]
    0x100221e64 <+108>:  str    q0, [sp, #0x730]
    0x100221e68 <+112>:  ldr    q0, [sp, #0x730]
    0x100221e6c <+116>:  str    q0, [sp, #0x720]
    0x100221e70 <+120>:  ldr    q2, [sp, #0x720]
    0x100221e74 <+124>:  mov    w8, #0xf4
    0x100221e78 <+128>:  strb   w8, [sp, #0x87f]
    0x100221e7c <+132>:  add    x9, sp, #0x87f
    0x100221e80 <+136>:  ld1r.16b { v0 }, [x9]
    0x100221e84 <+140>:  str    q0, [sp, #0x850]
    0x100221e88 <+144>:  ldr    q0, [sp, #0x850]
    0x100221e8c <+148>:  str    q0, [sp, #0x860]
    0x100221e90 <+152>:  ldr    q0, [sp, #0x860]
    0x100221e94 <+156>:  str    q2, [sp, #0x7e0]
    0x100221e98 <+160>:  str    q0, [sp, #0x7d0]
    0x100221e9c <+164>:  ldr    q0, [sp, #0x7e0]
    0x100221ea0 <+168>:  ldr    q2, [sp, #0x7d0]
    0x100221ea4 <+172>:  uqsub.16b v0, v0, v2
    0x100221ea8 <+176>:  str    q0, [sp, #0x7c0]
    0x100221eac <+180>:  ldr    q0, [sp, #0x7c0]
    0x100221eb0 <+184>:  str    q0, [sp, #0x530]
    0x100221eb4 <+188>:  ldr    q0, [sp, #0x530]
    0x100221eb8 <+192>:  str    q0, [sp, #0x520]
    0x100221ebc <+196>:  ldr    q0, [sp, #0x520]
    0x100221ec0 <+200>:  str    q1, [sp, #0x470]
    0x100221ec4 <+204>:  str    q0, [sp, #0x460]
    0x100221ec8 <+208>:  ldr    q0, [sp, #0x470]
    0x100221ecc <+212>:  ldr    q1, [sp, #0x460]
    0x100221ed0 <+216>:  orr.16b v0, v0, v1
    0x100221ed4 <+220>:  str    q0, [sp, #0x450]
    0x100221ed8 <+224>:  ldr    q0, [sp, #0x450]
    0x100221edc <+228>:  ldr    x9, [sp, #0x2b0]
    0x100221ee0 <+232>:  str    q0, [x9]
    0x100221ee4 <+236>:  adrp   x9, 3566
    0x100221ee8 <+240>:  add    x9, x9, #0x270            ; neon_check_utf8_bytes._nibbles
    0x100221eec <+244>:  ldr    q0, [x9]
    0x100221ef0 <+248>:  str    q0, [sp, #0x230]
    0x100221ef4 <+252>:  ldr    q0, [sp, #0x230]
    0x100221ef8 <+256>:  str    q0, [sp, #0x220]
    0x100221efc <+260>:  ldr    q1, [sp, #0x220]
    0x100221f00 <+264>:  ldr    q0, [sp, #0x290]
    0x100221f04 <+268>:  str    q0, [sp, #0x710]
    0x100221f08 <+272>:  ldr    q0, [sp, #0x710]
    0x100221f0c <+276>:  str    q0, [sp, #0x700]
    0x100221f10 <+280>:  ldr    q0, [sp, #0x700]
    0x100221f14 <+284>:  str    q1, [sp, #0x900]
    0x100221f18 <+288>:  str    q0, [sp, #0x8f0]
    0x100221f1c <+292>:  ldr    q0, [sp, #0x900]
    0x100221f20 <+296>:  ldr    q1, [sp, #0x8f0]
    0x100221f24 <+300>:  tbl.16b v0, { v0 }, v1
    0x100221f28 <+304>:  str    q0, [sp, #0x8e0]
    0x100221f2c <+308>:  ldr    q0, [sp, #0x8e0]
    0x100221f30 <+312>:  str    q0, [sp, #0x240]
    0x100221f34 <+316>:  ldr    x9, [sp, #0x2b8]
    0x100221f38 <+320>:  ldr    q0, [x9, #0x20]
    0x100221f3c <+324>:  str    q0, [sp, #0x1f0]
    0x100221f40 <+328>:  ldr    q0, [sp, #0x240]
    0x100221f44 <+332>:  str    q0, [sp, #0x1e0]
    0x100221f48 <+336>:  ldr    q0, [sp, #0x1f0]
    0x100221f4c <+340>:  ldr    q1, [sp, #0x1e0]
    0x100221f50 <+344>:  ext.16b v0, v0, v1, #0xf
    0x100221f54 <+348>:  str    q0, [sp, #0x200]
    0x100221f58 <+352>:  ldr    q0, [sp, #0x200]
    0x100221f5c <+356>:  str    q0, [sp, #0x1d0]
    0x100221f60 <+360>:  ldr    q0, [sp, #0x1d0]
    0x100221f64 <+364>:  str    q0, [sp, #0x6f0]
    0x100221f68 <+368>:  ldr    q0, [sp, #0x6f0]
    0x100221f6c <+372>:  str    q0, [sp, #0x6e0]
    0x100221f70 <+376>:  ldr    q1, [sp, #0x6e0]
    0x100221f74 <+380>:  mov    w9, #0x1
    0x100221f78 <+384>:  strb   w9, [sp, #0x84f]
    0x100221f7c <+388>:  add    x9, sp, #0x84f
    0x100221f80 <+392>:  ld1r.16b { v0 }, [x9]
    0x100221f84 <+396>:  str    q0, [sp, #0x820]
    0x100221f88 <+400>:  ldr    q0, [sp, #0x820]
    0x100221f8c <+404>:  str    q0, [sp, #0x830]
    0x100221f90 <+408>:  ldr    q0, [sp, #0x830]
    0x100221f94 <+412>:  str    q1, [sp, #0x7b0]
    0x100221f98 <+416>:  str    q0, [sp, #0x7a0]
    0x100221f9c <+420>:  ldr    q0, [sp, #0x7b0]
    0x100221fa0 <+424>:  ldr    q1, [sp, #0x7a0]
    0x100221fa4 <+428>:  uqsub.16b v0, v0, v1
    0x100221fa8 <+432>:  str    q0, [sp, #0x790]
    0x100221fac <+436>:  ldr    q0, [sp, #0x790]
    0x100221fb0 <+440>:  str    q0, [sp, #0x510]
    0x100221fb4 <+444>:  ldr    q0, [sp, #0x510]
    0x100221fb8 <+448>:  str    q0, [sp, #0x500]
    0x100221fbc <+452>:  ldr    q0, [sp, #0x500]
    0x100221fc0 <+456>:  str    q0, [sp, #0x210]
    0x100221fc4 <+460>:  ldr    q1, [sp, #0x240]
    0x100221fc8 <+464>:  ldr    q0, [sp, #0x210]
    0x100221fcc <+468>:  str    q1, [sp, #0x960]
    0x100221fd0 <+472>:  str    q0, [sp, #0x950]
    0x100221fd4 <+476>:  ldr    q0, [sp, #0x960]
    0x100221fd8 <+480>:  ldr    q1, [sp, #0x950]
    0x100221fdc <+484>:  add.16b v0, v0, v1
    0x100221fe0 <+488>:  str    q0, [sp, #0x940]
    0x100221fe4 <+492>:  ldr    q0, [sp, #0x940]
    0x100221fe8 <+496>:  str    q0, [sp, #0x1c0]
    0x100221fec <+500>:  ldr    x9, [sp, #0x2b8]
    0x100221ff0 <+504>:  ldr    q0, [x9, #0x20]
    0x100221ff4 <+508>:  str    q0, [sp, #0x190]
    0x100221ff8 <+512>:  ldr    q0, [sp, #0x1c0]
    0x100221ffc <+516>:  str    q0, [sp, #0x180]
    0x100222000 <+520>:  ldr    q0, [sp, #0x190]
    0x100222004 <+524>:  ldr    q1, [sp, #0x180]
    0x100222008 <+528>:  ext.16b v0, v0, v1, #0xe
    0x10022200c <+532>:  str    q0, [sp, #0x1a0]
    0x100222010 <+536>:  ldr    q0, [sp, #0x1a0]
    0x100222014 <+540>:  str    q0, [sp, #0x170]
    0x100222018 <+544>:  ldr    q0, [sp, #0x170]
    0x10022201c <+548>:  str    q0, [sp, #0x6d0]
    0x100222020 <+552>:  ldr    q0, [sp, #0x6d0]
    0x100222024 <+556>:  str    q0, [sp, #0x6c0]
    0x100222028 <+560>:  ldr    q1, [sp, #0x6c0]
    0x10022202c <+564>:  mov    w9, #0x2
    0x100222030 <+568>:  strb   w9, [sp, #0x81f]
    0x100222034 <+572>:  add    x9, sp, #0x81f
    0x100222038 <+576>:  ld1r.16b { v0 }, [x9]
    0x10022203c <+580>:  str    q0, [sp, #0x7f0]
    0x100222040 <+584>:  ldr    q0, [sp, #0x7f0]
    0x100222044 <+588>:  str    q0, [sp, #0x800]
    0x100222048 <+592>:  ldr    q0, [sp, #0x800]
    0x10022204c <+596>:  str    q1, [sp, #0x780]
    0x100222050 <+600>:  str    q0, [sp, #0x770]
    0x100222054 <+604>:  ldr    q0, [sp, #0x780]
    0x100222058 <+608>:  ldr    q1, [sp, #0x770]
    0x10022205c <+612>:  uqsub.16b v0, v0, v1
    0x100222060 <+616>:  str    q0, [sp, #0x760]
    0x100222064 <+620>:  ldr    q0, [sp, #0x760]
    0x100222068 <+624>:  str    q0, [sp, #0x4f0]
    0x10022206c <+628>:  ldr    q0, [sp, #0x4f0]
    0x100222070 <+632>:  str    q0, [sp, #0x4e0]
    0x100222074 <+636>:  ldr    q0, [sp, #0x4e0]
    0x100222078 <+640>:  str    q0, [sp, #0x1b0]
    0x10022207c <+644>:  ldr    q1, [sp, #0x1c0]
    0x100222080 <+648>:  ldr    q0, [sp, #0x1b0]
    0x100222084 <+652>:  str    q1, [sp, #0x930]
    0x100222088 <+656>:  str    q0, [sp, #0x920]
    0x10022208c <+660>:  ldr    q0, [sp, #0x930]
    0x100222090 <+664>:  ldr    q1, [sp, #0x920]
    0x100222094 <+668>:  add.16b v0, v0, v1
    0x100222098 <+672>:  str    q0, [sp, #0x910]
    0x10022209c <+676>:  ldr    q0, [sp, #0x910]
    0x1002220a0 <+680>:  str    q0, [sp, #0x2a0]
    0x1002220a4 <+684>:  ldr    q1, [sp, #0x2a0]
    0x1002220a8 <+688>:  ldr    q0, [sp, #0x240]
    0x1002220ac <+692>:  str    q1, [sp, #0x670]
    0x1002220b0 <+696>:  str    q0, [sp, #0x660]
    0x1002220b4 <+700>:  ldr    q0, [sp, #0x670]
    0x1002220b8 <+704>:  ldr    q1, [sp, #0x660]
    0x1002220bc <+708>:  cmgt.16b v0, v0, v1
    0x1002220c0 <+712>:  str    q0, [sp, #0x650]
    0x1002220c4 <+716>:  ldr    q1, [sp, #0x650]
    0x1002220c8 <+720>:  ldr    q2, [sp, #0x240]
    0x1002220cc <+724>:  mov    w9, #0x0
    0x1002220d0 <+728>:  strb   w9, [sp, #0x3bf]
    0x1002220d4 <+732>:  add    x9, sp, #0x3bf
    0x1002220d8 <+736>:  ld1r.16b { v0 }, [x9]
    0x1002220dc <+740>:  str    q0, [sp, #0x390]
    0x1002220e0 <+744>:  ldr    q0, [sp, #0x390]
    0x1002220e4 <+748>:  str    q0, [sp, #0x3a0]
    0x1002220e8 <+752>:  ldr    q0, [sp, #0x3a0]
    0x1002220ec <+756>:  str    q2, [sp, #0x640]
    0x1002220f0 <+760>:  str    q0, [sp, #0x630]
    0x1002220f4 <+764>:  ldr    q0, [sp, #0x640]
    0x1002220f8 <+768>:  ldr    q2, [sp, #0x630]
    0x1002220fc <+772>:  cmgt.16b v0, v0, v2
    0x100222100 <+776>:  str    q0, [sp, #0x620]
    0x100222104 <+780>:  ldr    q0, [sp, #0x620]
    0x100222108 <+784>:  str    q1, [sp, #0x990]
    0x10022210c <+788>:  str    q0, [sp, #0x980]
    0x100222110 <+792>:  ldr    q0, [sp, #0x990]
    0x100222114 <+796>:  ldr    q1, [sp, #0x980]
    0x100222118 <+800>:  cmeq.16b v0, v0, v1
    0x10022211c <+804>:  str    q0, [sp, #0x970]
    0x100222120 <+808>:  ldr    q0, [sp, #0x970]
    0x100222124 <+812>:  str    q0, [sp, #0x160]
    0x100222128 <+816>:  ldr    x9, [sp, #0x2b0]
    0x10022212c <+820>:  ldr    q1, [x9]
    0x100222130 <+824>:  ldr    q0, [sp, #0x160]
    0x100222134 <+828>:  str    q0, [sp, #0x4d0]
    0x100222138 <+832>:  ldr    q0, [sp, #0x4d0]
    0x10022213c <+836>:  str    q0, [sp, #0x4c0]
    0x100222140 <+840>:  ldr    q0, [sp, #0x4c0]
    0x100222144 <+844>:  str    q1, [sp, #0x440]
    0x100222148 <+848>:  str    q0, [sp, #0x430]
    0x10022214c <+852>:  ldr    q0, [sp, #0x440]
    0x100222150 <+856>:  ldr    q1, [sp, #0x430]
    0x100222154 <+860>:  orr.16b v0, v0, v1
    0x100222158 <+864>:  str    q0, [sp, #0x420]
    0x10022215c <+868>:  ldr    q0, [sp, #0x420]
    0x100222160 <+872>:  ldr    x9, [sp, #0x2b0]
    0x100222164 <+876>:  str    q0, [x9]
    0x100222168 <+880>:  ldr    x9, [sp, #0x2b8]
    0x10022216c <+884>:  ldr    q0, [x9]
    0x100222170 <+888>:  str    q0, [sp, #0x130]
    0x100222174 <+892>:  ldr    q0, [sp, #0x280]
    0x100222178 <+896>:  str    q0, [sp, #0x120]
    0x10022217c <+900>:  ldr    q0, [sp, #0x130]
    0x100222180 <+904>:  ldr    q1, [sp, #0x120]
    0x100222184 <+908>:  ext.16b v0, v0, v1, #0xf
    0x100222188 <+912>:  str    q0, [sp, #0x140]
    0x10022218c <+916>:  ldr    q0, [sp, #0x140]
    0x100222190 <+920>:  str    q0, [sp, #0x110]
    0x100222194 <+924>:  ldr    q0, [sp, #0x110]
    0x100222198 <+928>:  str    q0, [sp, #0x150]
    0x10022219c <+932>:  ldr    q1, [sp, #0x150]
    0x1002221a0 <+936>:  mov    w9, #0xed
    0x1002221a4 <+940>:  strb   w9, [sp, #0x38f]
    0x1002221a8 <+944>:  add    x9, sp, #0x38f
    0x1002221ac <+948>:  ld1r.16b { v0 }, [x9]
    0x1002221b0 <+952>:  str    q0, [sp, #0x360]
    0x1002221b4 <+956>:  ldr    q0, [sp, #0x360]
    0x1002221b8 <+960>:  str    q0, [sp, #0x370]
    0x1002221bc <+964>:  ldr    q0, [sp, #0x370]
    0x1002221c0 <+968>:  str    q1, [sp, #0x9f0]
    0x1002221c4 <+972>:  str    q0, [sp, #0x9e0]
    0x1002221c8 <+976>:  ldr    q0, [sp, #0x9f0]
    0x1002221cc <+980>:  ldr    q1, [sp, #0x9e0]
    0x1002221d0 <+984>:  cmeq.16b v0, v0, v1
    0x1002221d4 <+988>:  str    q0, [sp, #0x9d0]
    0x1002221d8 <+992>:  ldr    q0, [sp, #0x9d0]
    0x1002221dc <+996>:  str    q0, [sp, #0x100]
    0x1002221e0 <+1000>: ldr    q1, [sp, #0x150]
    0x1002221e4 <+1004>: strb   w8, [sp, #0x35f]
    0x1002221e8 <+1008>: add    x8, sp, #0x35f
    0x1002221ec <+1012>: ld1r.16b { v0 }, [x8]
    0x1002221f0 <+1016>: str    q0, [sp, #0x330]
    0x1002221f4 <+1020>: ldr    q0, [sp, #0x330]
    0x1002221f8 <+1024>: str    q0, [sp, #0x340]
    0x1002221fc <+1028>: ldr    q0, [sp, #0x340]
    0x100222200 <+1032>: str    q1, [sp, #0x9c0]
    0x100222204 <+1036>: str    q0, [sp, #0x9b0]
    0x100222208 <+1040>: ldr    q0, [sp, #0x9c0]
    0x10022220c <+1044>: ldr    q1, [sp, #0x9b0]
    0x100222210 <+1048>: cmeq.16b v0, v0, v1
    0x100222214 <+1052>: str    q0, [sp, #0x9a0]
    0x100222218 <+1056>: ldr    q0, [sp, #0x9a0]
    0x10022221c <+1060>: str    q0, [sp, #0xf0]
    0x100222220 <+1064>: ldr    q1, [sp, #0x2c0]
    0x100222224 <+1068>: mov    w8, #0x9f
    0x100222228 <+1072>: strb   w8, [sp, #0x32f]
    0x10022222c <+1076>: add    x8, sp, #0x32f
    0x100222230 <+1080>: ld1r.16b { v0 }, [x8]
    0x100222234 <+1084>: str    q0, [sp, #0x300]
    0x100222238 <+1088>: ldr    q0, [sp, #0x300]
    0x10022223c <+1092>: str    q0, [sp, #0x310]
    0x100222240 <+1096>: ldr    q0, [sp, #0x310]
    0x100222244 <+1100>: str    q1, [sp, #0x610]
    0x100222248 <+1104>: str    q0, [sp, #0x600]
    0x10022224c <+1108>: ldr    q0, [sp, #0x610]
    0x100222250 <+1112>: ldr    q1, [sp, #0x600]
    0x100222254 <+1116>: cmgt.16b v0, v0, v1
    0x100222258 <+1120>: str    q0, [sp, #0x5f0]
    0x10022225c <+1124>: ldr    q1, [sp, #0x5f0]
    0x100222260 <+1128>: ldr    q0, [sp, #0x100]
    0x100222264 <+1132>: str    q1, [sp, #0xa80]
    0x100222268 <+1136>: str    q0, [sp, #0xa70]
    0x10022226c <+1140>: ldr    q0, [sp, #0xa80]
    0x100222270 <+1144>: ldr    q1, [sp, #0xa70]
    0x100222274 <+1148>: and.16b v0, v0, v1
    0x100222278 <+1152>: str    q0, [sp, #0xa60]
    0x10022227c <+1156>: ldr    q0, [sp, #0xa60]
    0x100222280 <+1160>: str    q0, [sp, #0xe0]
    0x100222284 <+1164>: ldr    q1, [sp, #0x2c0]
    0x100222288 <+1168>: mov    w8, #0x8f
    0x10022228c <+1172>: strb   w8, [sp, #0x2ff]
    0x100222290 <+1176>: add    x8, sp, #0x2ff
    0x100222294 <+1180>: ld1r.16b { v0 }, [x8]
    0x100222298 <+1184>: str    q0, [sp, #0x2d0]
    0x10022229c <+1188>: ldr    q0, [sp, #0x2d0]
    0x1002222a0 <+1192>: str    q0, [sp, #0x2e0]
    0x1002222a4 <+1196>: ldr    q0, [sp, #0x2e0]
    0x1002222a8 <+1200>: str    q1, [sp, #0x5e0]
    0x1002222ac <+1204>: str    q0, [sp, #0x5d0]
    0x1002222b0 <+1208>: ldr    q0, [sp, #0x5e0]
    0x1002222b4 <+1212>: ldr    q1, [sp, #0x5d0]
    0x1002222b8 <+1216>: cmgt.16b v0, v0, v1
    0x1002222bc <+1220>: str    q0, [sp, #0x5c0]
    0x1002222c0 <+1224>: ldr    q1, [sp, #0x5c0]
    0x1002222c4 <+1228>: ldr    q0, [sp, #0xf0]
    0x1002222c8 <+1232>: str    q1, [sp, #0xa50]
    0x1002222cc <+1236>: str    q0, [sp, #0xa40]
    0x1002222d0 <+1240>: ldr    q0, [sp, #0xa50]
    0x1002222d4 <+1244>: ldr    q1, [sp, #0xa40]
    0x1002222d8 <+1248>: and.16b v0, v0, v1
    0x1002222dc <+1252>: str    q0, [sp, #0xa30]
    0x1002222e0 <+1256>: ldr    q0, [sp, #0xa30]
    0x1002222e4 <+1260>: str    q0, [sp, #0xd0]
    0x1002222e8 <+1264>: ldr    x8, [sp, #0x2b0]
    0x1002222ec <+1268>: ldr    q1, [x8]
    0x1002222f0 <+1272>: ldr    q2, [sp, #0xe0]
    0x1002222f4 <+1276>: ldr    q0, [sp, #0xd0]
    0x1002222f8 <+1280>: str    q2, [sp, #0xab0]
    0x1002222fc <+1284>: str    q0, [sp, #0xaa0]
    0x100222300 <+1288>: ldr    q0, [sp, #0xab0]
    0x100222304 <+1292>: ldr    q2, [sp, #0xaa0]
    0x100222308 <+1296>: orr.16b v0, v0, v2
    0x10022230c <+1300>: str    q0, [sp, #0xa90]
    0x100222310 <+1304>: ldr    q0, [sp, #0xa90]
    0x100222314 <+1308>: str    q0, [sp, #0x4b0]
    0x100222318 <+1312>: ldr    q0, [sp, #0x4b0]
    0x10022231c <+1316>: str    q0, [sp, #0x4a0]
    0x100222320 <+1320>: ldr    q0, [sp, #0x4a0]
    0x100222324 <+1324>: str    q1, [sp, #0x410]
    0x100222328 <+1328>: str    q0, [sp, #0x400]
    0x10022232c <+1332>: ldr    q0, [sp, #0x410]
    0x100222330 <+1336>: ldr    q1, [sp, #0x400]
    0x100222334 <+1340>: orr.16b v0, v0, v1
    0x100222338 <+1344>: str    q0, [sp, #0x3f0]
    0x10022233c <+1348>: ldr    q0, [sp, #0x3f0]
    0x100222340 <+1352>: ldr    x8, [sp, #0x2b0]
    0x100222344 <+1356>: str    q0, [x8]
    0x100222348 <+1360>: ldr    x8, [sp, #0x2b8]
    0x10022234c <+1364>: ldr    q0, [x8, #0x10]
    0x100222350 <+1368>: str    q0, [sp, #0xa0]
    0x100222354 <+1372>: ldr    q0, [sp, #0x290]
    0x100222358 <+1376>: str    q0, [sp, #0x90]
    0x10022235c <+1380>: ldr    q0, [sp, #0xa0]
    0x100222360 <+1384>: ldr    q1, [sp, #0x90]
    0x100222364 <+1388>: ext.16b v0, v0, v1, #0xf
    0x100222368 <+1392>: str    q0, [sp, #0xb0]
    0x10022236c <+1396>: ldr    q0, [sp, #0xb0]
    0x100222370 <+1400>: str    q0, [sp, #0x80]
    0x100222374 <+1404>: ldr    q0, [sp, #0x80]
    0x100222378 <+1408>: str    q0, [sp, #0xc0]
    0x10022237c <+1412>: adrp   x8, 3565
    0x100222380 <+1416>: add    x8, x8, #0x280            ; neon_check_utf8_bytes._initial_mins
    0x100222384 <+1420>: ldr    q0, [x8]
    0x100222388 <+1424>: str    q0, [sp, #0x60]
    0x10022238c <+1428>: ldr    q0, [sp, #0x60]
    0x100222390 <+1432>: str    q0, [sp, #0x50]
    0x100222394 <+1436>: ldr    q1, [sp, #0x50]
    0x100222398 <+1440>: ldr    q0, [sp, #0xc0]
    0x10022239c <+1444>: str    q0, [sp, #0x6b0]
    0x1002223a0 <+1448>: ldr    q0, [sp, #0x6b0]
    0x1002223a4 <+1452>: str    q0, [sp, #0x6a0]
    0x1002223a8 <+1456>: ldr    q0, [sp, #0x6a0]
    0x1002223ac <+1460>: str    q1, [sp, #0x8d0]
    0x1002223b0 <+1464>: str    q0, [sp, #0x8c0]
    0x1002223b4 <+1468>: ldr    q0, [sp, #0x8d0]
    0x1002223b8 <+1472>: ldr    q1, [sp, #0x8c0]
    0x1002223bc <+1476>: tbl.16b v0, { v0 }, v1
    0x1002223c0 <+1480>: str    q0, [sp, #0x8b0]
    0x1002223c4 <+1484>: ldr    q0, [sp, #0x8b0]
    0x1002223c8 <+1488>: str    q0, [sp, #0x70]
    0x1002223cc <+1492>: ldr    q1, [sp, #0x70]
    0x1002223d0 <+1496>: ldr    q0, [sp, #0x150]
    0x1002223d4 <+1500>: str    q1, [sp, #0x5b0]
    0x1002223d8 <+1504>: str    q0, [sp, #0x5a0]
    0x1002223dc <+1508>: ldr    q0, [sp, #0x5b0]
    0x1002223e0 <+1512>: ldr    q1, [sp, #0x5a0]
    0x1002223e4 <+1516>: cmgt.16b v0, v0, v1
    0x1002223e8 <+1520>: str    q0, [sp, #0x590]
    0x1002223ec <+1524>: ldr    q0, [sp, #0x590]
    0x1002223f0 <+1528>: str    q0, [sp, #0x40]
    0x1002223f4 <+1532>: adrp   x8, 3565
    0x1002223f8 <+1536>: add    x8, x8, #0x290            ; neon_check_utf8_bytes._second_mins
    0x1002223fc <+1540>: ldr    q0, [x8]
    0x100222400 <+1544>: str    q0, [sp, #0x20]
    0x100222404 <+1548>: ldr    q0, [sp, #0x20]
    0x100222408 <+1552>: str    q0, [sp, #0x10]
    0x10022240c <+1556>: ldr    q1, [sp, #0x10]
    0x100222410 <+1560>: ldr    q0, [sp, #0xc0]
    0x100222414 <+1564>: str    q0, [sp, #0x690]
    0x100222418 <+1568>: ldr    q0, [sp, #0x690]
    0x10022241c <+1572>: str    q0, [sp, #0x680]
    0x100222420 <+1576>: ldr    q0, [sp, #0x680]
    0x100222424 <+1580>: str    q1, [sp, #0x8a0]
    0x100222428 <+1584>: str    q0, [sp, #0x890]
    0x10022242c <+1588>: ldr    q0, [sp, #0x8a0]
    0x100222430 <+1592>: ldr    q1, [sp, #0x890]
    0x100222434 <+1596>: tbl.16b v0, { v0 }, v1
    0x100222438 <+1600>: str    q0, [sp, #0x880]
    0x10022243c <+1604>: ldr    q0, [sp, #0x880]
    0x100222440 <+1608>: str    q0, [sp, #0x30]
    0x100222444 <+1612>: ldr    q1, [sp, #0x30]
    0x100222448 <+1616>: ldr    q0, [sp, #0x2c0]
    0x10022244c <+1620>: str    q1, [sp, #0x580]
    0x100222450 <+1624>: str    q0, [sp, #0x570]
    0x100222454 <+1628>: ldr    q0, [sp, #0x580]
    0x100222458 <+1632>: ldr    q1, [sp, #0x570]
    0x10022245c <+1636>: cmgt.16b v0, v0, v1
    0x100222460 <+1640>: str    q0, [sp, #0x560]
    0x100222464 <+1644>: ldr    q0, [sp, #0x560]
    0x100222468 <+1648>: str    q0, [sp]
    0x10022246c <+1652>: ldr    x8, [sp, #0x2b0]
    0x100222470 <+1656>: ldr    q1, [x8]
    0x100222474 <+1660>: ldr    q2, [sp, #0x40]
    0x100222478 <+1664>: ldr    q0, [sp]
    0x10022247c <+1668>: str    q2, [sp, #0xa20]
    0x100222480 <+1672>: str    q0, [sp, #0xa10]
    0x100222484 <+1676>: ldr    q0, [sp, #0xa20]
    0x100222488 <+1680>: ldr    q2, [sp, #0xa10]
    0x10022248c <+1684>: and.16b v0, v0, v2
    0x100222490 <+1688>: str    q0, [sp, #0xa00]
    0x100222494 <+1692>: ldr    q0, [sp, #0xa00]
    0x100222498 <+1696>: str    q0, [sp, #0x490]
    0x10022249c <+1700>: ldr    q0, [sp, #0x490]
    0x1002224a0 <+1704>: str    q0, [sp, #0x480]
    0x1002224a4 <+1708>: ldr    q0, [sp, #0x480]
    0x1002224a8 <+1712>: str    q1, [sp, #0x3e0]
    0x1002224ac <+1716>: str    q0, [sp, #0x3d0]
    0x1002224b0 <+1720>: ldr    q0, [sp, #0x3e0]
    0x1002224b4 <+1724>: ldr    q1, [sp, #0x3d0]
    0x1002224b8 <+1728>: orr.16b v0, v0, v1
    0x1002224bc <+1732>: str    q0, [sp, #0x3c0]
    0x1002224c0 <+1736>: ldr    q0, [sp, #0x3c0]
    0x1002224c4 <+1740>: ldr    x8, [sp, #0x2b0]
    0x1002224c8 <+1744>: str    q0, [x8]
    0x1002224cc <+1748>: ldr    x8, [sp, #0x2b8]
    0x1002224d0 <+1752>: ldr    q0, [sp, #0x280]
    0x1002224d4 <+1756>: ldr    q1, [sp, #0x290]
    0x1002224d8 <+1760>: ldr    q2, [sp, #0x2a0]
    0x1002224dc <+1764>: str    q2, [x8, #0x20]
    0x1002224e0 <+1768>: str    q1, [x8, #0x10]
    0x1002224e4 <+1772>: str    q0, [x8]
    0x1002224e8 <+1776>: add    sp, sp, #0xac0
    0x1002224ec <+1780>: ldp    x28, x27, [sp], #0x10
    0x1002224f0 <+1784>: ret

@alexdowad
Copy link
Contributor

@youkidearitai If you are no longer interested in this problem, that is fine; this is OSS and no developer is 'forced' to work on something they don't want to work on.

However, if you are still interested in this problem, but are closing the PR because figuring out what is going on seems difficult, I would encourage you to give it more time. Technical challenges which seem overwhelming can often be overcome if one is persistent and tries different approaches until something works. And if you do so, you may find that you learn a lot in the process.

Up to you either way.

@youkidearitai
Copy link
Contributor Author

@alexdowad Thank you very much. I don't wanna give up.

Up to you either way.

Thanks again, Please give me a little more time.

@youkidearitai youkidearitai reopened this Apr 15, 2023
@alexdowad
Copy link
Contributor

@youkidearitai I'm glad to hear you are hoping to give this PR more time. I think it will be very valuable.

As mentioned above, I can't directly help you on this, but I can give you ideas of what to try.

I think a good first step would be to compile @cyb70289's benchmarking code on your machine, run it, and see if you get results comparable to theirs or not.

@youkidearitai
Copy link
Contributor Author

@alexdowad Thanks for advice! I take benchmark below.

Base code: https://gist.github.com/youkidearitai/7cd8771f6f6e40e21708129707b40204

(master is 5823955)

I think M1 mac is originally fast, Raspberry Pi SoC is particularly effective (2.4x faster).

M1 macOS

master

MacBook-Air:youkidearitai-php-src tekimen$ sapi/cli/php long-utf-8-bench.php
time: 914805542
MacBook-Air:youkidearitai-php-src tekimen$ sapi/cli/php long-utf-8-bench.php
time: 748921167
MacBook-Air:youkidearitai-php-src tekimen$ sapi/cli/php long-utf-8-bench.php
time: 756241666
MacBook-Air:youkidearitai-php-src tekimen$ sapi/cli/php long-utf-8-bench.php
time: 747524042
MacBook-Air:youkidearitai-php-src tekimen$ sapi/cli/php long-utf-8-bench.php
time: 751337333

average is 783765950

neonutf8

MacBook-Air:youkidearitai-php-src tekimen$ sapi/cli/php long-utf-8-bench.php
time: 620495000
MacBook-Air:youkidearitai-php-src tekimen$ sapi/cli/php long-utf-8-bench.php
time: 414064041
MacBook-Air:youkidearitai-php-src tekimen$ sapi/cli/php long-utf-8-bench.php
time: 419568708
MacBook-Air:youkidearitai-php-src tekimen$ sapi/cli/php long-utf-8-bench.php
time: 427726542
MacBook-Air:youkidearitai-php-src tekimen$ sapi/cli/php long-utf-8-bench.php
time: 426510834

average is 461673025

Result

neonutf8 / master is 1.69766459714643x faster.

Raspbian on Raspberry Pi 4B+

master

tekimen@raspberrypi:~/src/youkidearitai_php-src $ sapi/cli/php long-utf-8-bench.php
time: 4569498667
tekimen@raspberrypi:~/src/youkidearitai_php-src $ sapi/cli/php long-utf-8-bench.php
time: 4575386789
tekimen@raspberrypi:~/src/youkidearitai_php-src $ sapi/cli/php long-utf-8-bench.php
time: 4573909173
tekimen@raspberrypi:~/src/youkidearitai_php-src $ sapi/cli/php long-utf-8-bench.php
time: 4573393812
tekimen@raspberrypi:~/src/youkidearitai_php-src $ sapi/cli/php long-utf-8-bench.php
time: 4577176635

neonutf8

tekimen@raspberrypi:~/src/youkidearitai_php-src $ sapi/cli/php long-utf-8-bench.php
time: 1690266904
tekimen@raspberrypi:~/src/youkidearitai_php-src $ sapi/cli/php long-utf-8-bench.php
time: 1689829703
tekimen@raspberrypi:~/src/youkidearitai_php-src $ sapi/cli/php long-utf-8-bench.php
time: 1697861673
tekimen@raspberrypi:~/src/youkidearitai_php-src $ sapi/cli/php long-utf-8-bench.php
time: 1690421242
tekimen@raspberrypi:~/src/youkidearitai_php-src $ sapi/cli/php long-utf-8-bench.php
time: 1689980379

Result

master / neonutf8 is 2.41915824648452x faster.

@youkidearitai
Copy link
Contributor Author

I'm reading on Arm Neon Intrinsics Reference (PDF file) that studying more.

@alexdowad
Copy link
Contributor

@youkidearitai So if I have this right, it looks like your Mac M1 was able to process 1.4GB (14000 byte file repeated 100,000 times) of UTF-8 text in 461ms. Is that right? If so, your computer was able to process 2896MB/sec (1024 * 1024 bytes / MB).

Looks like your computer is a lot faster than @cyb70289's.

@cyb70289 found that his 'range2' NEON code was 2.8 times faster than his 'naive' code. But maybe our scalar validation function might be faster than that 'naive' one.

When I have a bit of time I may try benchmarking to see whether that is true. If so, it would explain the difference between your results and @cyb70289's results.

@cyb70289
Copy link

Apple M1 is indeed much faster than the machines I used.

@alexdowad
Copy link
Contributor

Apple M1 is indeed much faster than the machines I used.

Indeed, thanks for the comment.
We are just wondering if we are using your code correctly, since the performance boost from our existing scalar code seems smaller than expected. It could be that we are doing something wrong and this is reducing performance.

@cyb70289
Copy link

I'm not sure of the your use case. Just want to mention that if the strings to be verified contain mostly ascii chars, with few multi byte chars, this library is not good for this condition. It may hurt performance.

@youkidearitai
Copy link
Contributor Author

@cyb70289 Thanks for comment. I used your code as a reference. thank you again.

if the strings to be verified contain mostly ascii chars

I will consider the logic to determine whether it is ASCII

When I have a bit of time I may try benchmarking to see whether that is true. If so, it would explain the difference between your results and @cyb70289's results.

@alexdowad Okay, I'll try to benchmark.

if all registers (16 bytes) lower than 0x7F, assumed to be ASCII.
@youkidearitai youkidearitai force-pushed the neonutf8 branch 3 times, most recently from 40ecd32 to 7101d42 Compare April 18, 2023 17:36
@youkidearitai
Copy link
Contributor Author

ASCII logic included. but JST is 2:30. After sleep then benchmark further.

$ sapi/cli/php long-utf-8-bench.php
bool(true)
time: 461009833
MacBook-Air:youkidearitai-php-src tekimen$ sapi/cli/php long-utf-8-bench.php
bool(true)
time: 334384125
MacBook-Air:youkidearitai-php-src tekimen$ sapi/cli/php long-utf-8-bench.php
bool(true)
time: 331224667
MacBook-Air:youkidearitai-php-src tekimen$ sapi/cli/php long-utf-8-bench.php
bool(true)
time: 326196375
MacBook-Air:youkidearitai-php-src tekimen$ sapi/cli/php long-utf-8-bench.php
bool(true)
time: 320748542
MacBook-Air:youkidearitai-php-src tekimen$ sapi/cli/php long-utf-8-bench.php
bool(true)
time: 322179833

Lower than 0x7F that all bytes SIMD register, then reset previous
struct.
@alexdowad
Copy link
Contributor

@cyb70289 From the git commit logs in your repository, it looks like you and @easyaspi314 are the authors of the NEON-accelerated range2 UTF-8 validation implementation.

A question, please... in order for more people to benefit from your work, would you be willing to give permission for your code to be incorporated in the PHP codebase and distributed under the PHP license? I don't expect that you will allow this, but if you do, it would be appreciated. (Of course, code comments would be included identifying you as the authors and pointing readers to the original code repository.)

@cyb70289
Copy link

@alexdowad , I'm glad you find my utf8 library useful. It's okay to use it in php under php license.

@youkidearitai
Copy link
Contributor Author

Investigation and my opinion.

One of use case of mb_check_encoding that "name" is valid UTF-8 on contact form.
As a example, my real name is 濱田侑弥, (濱田 (Hamada) is last name, 侑弥 (Yuya) is first name) that byte length is 12 bytes.
It may be a small number of bytes when checking for UTF-8 when entering a name in the contact form.

Lemire UTF-8 validation of weak point is "less than 16 bytes". But thinking of use case is often when small number of byte.
@alexdowad used SSE2 validation is "dirty hack", but NEON is not find similar Intrinsic. Therefore used memset and memcpy that is slow.
For range validation is fallback to naive when less than 16 bytes. Performance is not too bad.

Therefore, I want to use range algorithm.

@cyb70289
Copy link

Just FYI, simdjson (also from Lemire) implements a utf-8 validation said to be much faster than other libraries.
https://github.com/simdjson/simdjson/blob/master/doc/basics.md#utf-8-validation-alone
I didn't evaluate it. Might deserve give it a go.

@alexdowad
Copy link
Contributor

@cyb70289 Thanks for pointing us to simdjson. This is the actual implementation of UTF-8 validation in that library: https://github.com/simdjson/simdjson/blob/d4ac1b51d0aeb2d4f792136fe7792de709006afa/src/generic/stage1/utf8_lookup4_algorithm.h

It's using Lemire's algorithm, same as simdutf.

If someone is interested in benchmarking it, that might be interesting, but (at the moment) I don't see any reason to suspect that it will be faster than your implementation of Lemire's algorithm or your range/range2 algorithms.

@alexdowad
Copy link
Contributor

Investigation and my opinion.

Thanks for those good points. Please note that you can easily add an if clause to any SIMD function to make it fall back to the scalar version if the input is small. However, if you prefer using range rather than range2, that is fine.

@cyb70289 has kindly given permission for his code to be distributed under the PHP license. Both the range and range2 code includes contributions from @easyaspi314, so it would be nice to hear from him/her as well.

In the meantime, @youkidearitai, I would suggest you try importing the range/range2 NEON code (whichever you choose) and start testing.

There is another important issue which needs to be addressed here, but first let's just confirm that everything works fine when range/range2 is imported into mbstring.

@easyaspi314
Copy link

Both the range and range2 code includes contributions from @easyaspi314, so it would be nice to hear from him/her as well.

I honestly forgot I wrote this code lol 😅

I give my permission to use it.

@youkidearitai
Copy link
Contributor Author

@cyb70289 @easyaspi314 Thanks for approved to use to algorithm.
I implementation range algorithm, testing that.

@youkidearitai
Copy link
Contributor Author

fixed: I tried compile on Raspberry Pi 1 that using utf8_naive function. I find to miss to semicolon;.

@youkidearitai
Copy link
Contributor Author

M1 mac benchmark

neonutf8 brnach

MacBook-Air:youkidearitai-php-src tekimen$ sapi/cli/php utf8-bench.php
bool(true)
time: 6167

master branch

MacBook-Air:youkidearitai-php-src tekimen$ sapi/cli/php utf8-bench.php
bool(true)
time: 48542

48542 / 6167 = 7.8712502026x faster.

Raspberry Pi 4B+ benchmark

neonutf8 branch

tekimen@raspberrypi:~/src/youkidearitai_php-src $ sapi/cli/php utf8_bench.php
bool(true)
time: 58592

master branch

tekimen@raspberrypi:~/src/youkidearitai_php-src $ sapi/cli/php utf8_bench.php
bool(true)
time: 102887

102887 / 58592 = 1.7559905789x faster

@easyaspi314
Copy link

easyaspi314 commented Apr 29, 2023

On second look at this code (and the original), there is a major problem, there is no short-circuit.

If there is an error at the beginning of a very long string, it would still go through the entire string, forcing a full O(n) check.

Perhaps the vmaxv_u8 check could be in a bigger loop that loops for every 64 or 128 bytes? Checking every 16 bytes is bad, as NEON benefits a lot from loading a bunch of vectors at once thanks to ldp and the ridiculous number of registers, and that is only possible if the compiler can determine that the loop will run a multiple of n times

@youkidearitai
Copy link
Contributor Author

@easyaspi314 Thank you very much!

Perhaps the vmaxv_u8 check could be in a bigger loop that loops for every 64 or 128 bytes? Checking every 16 bytes is bad, as NEON benefits a lot from loading a bunch of vectors at once thanks to ldp and the ridiculous number of registers, and that is only possible if the compiler can determine that the loop will run a multiple of n times

I read https://developer.arm.com/documentation/ddi0596/2021-03/SIMD-FP-Instructions/UMAXV--Unsigned-Maximum-across-Vector-?lang=en. Certainly it seems that it can not be decided unless it is compiled.

For example, Can we use vaddlvq_u8 instead of vmaxv_u8 ? like below.

if (vaddlvq_u8(vmaxq_u8(error1, error2)) != 0) {
    return false; /* invalid UTF-8 */
}

prev_input = input;
/* omit, next 16 bytes check */

@easyaspi314
Copy link

easyaspi314 commented Apr 29, 2023

Well come to think of it, if we are going to be checking it frequently in the main loop, I could be more optimal.

The timing for the reducing instructions is pretty bad so instead of vmaxvq we should use the pairwise vpmaxq and then extract the low 64 bits. This has the most optimal timing.
(However, I could probably micro-optimize it further 😜)

/* Merge the error vectors */
uint8x16_t error = vorrq_u8(error1, error2);
/*
 * Take the max of each adjacent element, selecting the errors (0xFF) into
 * the low 8 elements of the vector. The upper bits are ignored.
 */
uint8x16_t error_paired = vpmaxq_u8(error, error);
/* Extract the raw bit pattern of the low 8 elements. */
uint64_t error_raw = vgetq_lane_u64(vreinterpretq_u64_u8(error_paired), 0);
/* If any bits are nonzero, there is an error. */
if (error_raw != 0) {
    return false;
}

This avoids the pipeline-stalling umaxv.

orr     v0.16b, v0.16b, v1.16b
umaxp   v0.16b, v0.16b, v0.16b
fmov    x0, d0
cbnz    x0, .Lfalse

Edit: also len should also be size_t or things will explode on strings > 2 GiB.

Use vpmaxq instead of vmaxvq and extract low 64 bits that optimal timings
and avoid pipeline-stalling umaxv.
@youkidearitai
Copy link
Contributor Author

@easyaspi314 Thank you very much for advice. I pushed your code.

Edit: also len should also be size_t or things will explode on strings > 2 GiB.

Thanks again. I fixed missed it 😂

@easyaspi314
Copy link

easyaspi314 commented Apr 29, 2023

I did some on-device benchmarking and checking every 64 bytes is about 15% faster on my Tensor G1 (Cortex-X1) with clang-16 -O3. The Cortex-A53 also performs slightly better, but that seems to be solely due to branch overhead - tbl seemingly cannot be dual issued so the ILP of unrolling has no benefit.

It also fixes an endianness bug because some people like to see things burn.

I will make a PR for the original repo later today if you want to take it from there but this is the gist:

#define PROCESS_NEON(num_bytes) \
   do { \
       /* Avoid a dependency on other iterations */ \
       uint8x16_t error1 = vdupq_n_u8(0); \
       uint8x16_t error2 = vdupq_n_u8(0); \
       size_t num_iters = num_bytes / sizeof(uint8x16_t); \
       /* Parse a block of data, marking any errors in error1 and error2 */ \
       for (size_t i = 0; i < num_iters; i++) { \
           (parsing code) \
       } \
       /* Check the error flags */ \
       (Test error flags) \
   } while (0)

/* How much data to process before checking the error flag. */
size_t block_size = 4 * sizeof(uint8x16_t); /* 64 bytes */

/* Process 64 bytes at a time */
while (len >= block_size) {
        PROCESS_NEON(block_size);
}
/* Process the remaining data */
if (len >= sizeof(uint8x16_t)) {
    PROCESS_NEON(len);
}

/* Check if in the middle of a sequence */
if (len) {
    const int8_t *token = (const int8_t *)(data - 3);
    size_t lookahead = 0;
    if (token[2] > (int8_t)0xBF) {
        lookahead = 1;
    } else if (token[1] > (int8_t)0xBF) {
        lookahead = 2;
    } else if (token[0] > (int8_t)0xBF) {
        lookahead = 3;
    }

    data -= lookahead;
    len += lookahead;
}

@easyaspi314
Copy link

easyaspi314 commented Apr 29, 2023

...

utf8_naive: 2664.8 MiB/s
utf8_range (64 byte blocks): 5560.1 MiB/s
utf8_range (16 byte blocks): 4749.9 MiB/s
simdjson::validate_utf8: 37249.7 MiB/s

Something tells me that is a better option... 😅

Although simdjson is apache 2.0 and written in C++ so that might be a problem.

It also doesn't seem to short circuit but I don't think it needs to at that speed.

@alexdowad
Copy link
Contributor

...

utf8_naive: 2664.8 MiB/s
utf8_range (64 byte blocks): 5560.1 MiB/s
utf8_range (16 byte blocks): 4749.9 MiB/s
simdjson::validate_utf8: 37249.7 MiB/s

Something tells me that is a better option... sweat_smile

Although simdjson is apache 2.0 and written in C++ so that might be a problem.

It also doesn't seem to short circuit but I don't think it needs to

Wow!!

I guess we need to look more carefully at simdjson and figure out what their secret is.

I gave it a cursory look-over, but it appeared to just be an implementation of the same Lemire algorithm. Not sure what I missed.

@youkidearitai
Copy link
Contributor Author

youkidearitai commented Apr 29, 2023

wow...!
I want to know why simdjson is very fast. I'll investigate. But C++ is hard to read...

@youkidearitai
Copy link
Contributor Author

memo: I running simdjson on GDB, it seems used multiple chunks, possibly is it reason why fast?

3126            } else SIMDJSON_IF_CONSTEXPR (simd8x64<uint8_t>::NUM_CHUNKS == 2) {
3127              this->check_utf8_bytes(input.chunks[0], this->prev_input_block);
3128              this->check_utf8_bytes(input.chunks[1], input.chunks[0]);
3129            } else SIMDJSON_IF_CONSTEXPR (simd8x64<uint8_t>::NUM_CHUNKS == 4) {
3130              this->check_utf8_bytes(input.chunks[0], this->prev_input_block);
3131              this->check_utf8_bytes(input.chunks[1], input.chunks[0]);
3132              this->check_utf8_bytes(input.chunks[2], input.chunks[1]);
3133              this->check_utf8_bytes(input.chunks[3], input.chunks[2]);
3134            }
3135            this->prev_incomplete = is_incomplete(input.chunks[simd8x64<uint8_t>::NUM_CHUNKS-1]);
(gdb) p input
$3 = (const simdjson::arm64::(anonymous namespace)::simd::simd8x64<unsigned char> &) @0x7ffffff140: {static NUM_CHUNKS = <optimized out>,
  chunks = {
    {<simdjson::arm64::(anonymous namespace)::simd::base_u8<unsigned char, simdjson::arm64::(anonymous namespace)::simd::simd8<bool> >> = {
        value = {109, 97, 202, 179, 107, 202, 138, 115, 32, 107, 117, 203, 144, 110, 93, 32},
        static SIZE = <optimized out>}, <No data fields>},
    {<simdjson::arm64::(anonymous namespace)::simd::base_u8<unsigned char, simdjson::arm64::(anonymous namespace)::simd::simd8<bool> >> = {
        value = {60, 104, 116, 116, 112, 58, 47, 47, 119, 119, 119, 46, 99, 108, 46, 99}, static SIZE = <optimized out>}, <No data fields>},
    {<simdjson::arm64::(anonymous namespace)::simd::base_u8<unsigned char, simdjson::arm64::(anonymous namespace)::simd::simd8<bool> >> = {
        value = {97, 109, 46, 97, 99, 46, 117, 107, 47, 126, 109, 103, 107, 50, 53, 47}, static SIZE = <optimized out>}, <No data fields>},
    {<simdjson::arm64::(anonymous namespace)::simd::base_u8<unsigned char, simdjson::arm64::(anonymous namespace)::simd::simd8<bool> >> = {
        value = {62, 32, 226, 128, 148, 32, 50, 48, 48, 50, 45, 48, 55, 45, 50, 53}, static SIZE = <optimized out>}, <No data fields>}}}
(gdb)

almost 1.48 times faster with this improvement on Raspberry Pi 4B+.
but maybe this is limit of this approach.
@youkidearitai
Copy link
Contributor Author

I took a benchmark https://github.com/simdutf/simdutf on Raspberry Pi 4B+ with range2 algorithm. simdutf is fast, I don't well know why fast simdutf. Give me time to understand.

Maybe zero buffer is checking ASCII (range2 include checking ASCII, speed up 1GB/s to 3GB/s).

tekimen@raspberrypi:~/src/is_utf8/benchmarks $ ./bench
random UTF-8
string size = 40096
basic_validate_utf8   0.148980 GB/s
range2                1.021823 GB/s
simdutf               1.355062 GB/s
is_utf8               1.352731 GB/s

random UTF-8
string size = 100000
basic_validate_utf8   0.149221 GB/s
range2                1.031310 GB/s
simdutf               1.373346 GB/s
is_utf8               1.372556 GB/s

random UTF-8
string size = 50000
basic_validate_utf8   0.149688 GB/s
range2                1.027117 GB/s
simdutf               1.360068 GB/s
is_utf8               1.359960 GB/s

zero buffer
string size = 40096
basic_validate_utf8   0.821548 GB/s
simdutf               8.694788 GB/s
range2                3.238876 GB/s
is_utf8               8.596844 GB/s

zero buffer
string size = 100000
basic_validate_utf8   0.841332 GB/s
simdutf               9.061406 GB/s
range2                3.325850 GB/s
is_utf8               9.017293 GB/s

zero buffer
string size = 50000
basic_validate_utf8   0.812100 GB/s
simdutf               8.889932 GB/s
range2                3.281101 GB/s
is_utf8               8.825350 GB/s

@easyaspi314
Copy link

is_utf8, simdutf, and simdjson all use the same code for UTF-8 validation. Just with a different namespace.

Also yes there is a check to determine if a block is entirely ASCII which lets the code fly by all the twiddling and stuff. That is the reason it gets 9 GB/s (or 37 GB/s in my case) on an ASCII only file.

@youkidearitai
Copy link
Contributor Author

@easyaspi314 Thanks for advice.

I try more efficient performance bring little faster. I efficient performance ASCII check.

I took benchmark using to simdutf on Raspberry Pi 4B+. If want to speed, I need an any idea🙇

tekimen@raspberrypi:~/src/is_utf8/benchmarks $ ./bench
random UTF-8
string size = 40096
basic_validate_utf8   0.148469 GB/s
range2                0.877376 GB/s
simdutf               1.339073 GB/s
is_utf8               1.340687 GB/s

random UTF-8
string size = 100000
basic_validate_utf8   0.148266 GB/s
range2                0.967565 GB/s
simdutf               1.366771 GB/s
is_utf8               1.367743 GB/s

random UTF-8
string size = 50000
basic_validate_utf8   0.148565 GB/s
range2                0.967007 GB/s
simdutf               1.351044 GB/s
is_utf8               1.355122 GB/s

zero buffer
string size = 40096
basic_validate_utf8   0.597182 GB/s
simdutf               8.341439 GB/s
range2                7.210790 GB/s
is_utf8               8.304694 GB/s

zero buffer
string size = 100000
basic_validate_utf8   0.597918 GB/s
simdutf               8.858230 GB/s
range2                7.608631 GB/s
is_utf8               8.843376 GB/s

zero buffer
string size = 50000
basic_validate_utf8   0.597321 GB/s
simdutf               8.610818 GB/s
range2                7.494379 GB/s
is_utf8               8.606493 GB/s

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants