Description
When Firefox switched from Rust 1.24.0 to Rust 1.25.0, the win32 performance of encoding_rs's UTF-8 validation function dropped 12.5% when used on ASCII input. encoding_rs's UTF-8 validation function is a fork of the Rust standard library validation function that replaces the ASCII acceleration ALU trick that autovectorizes on x86_64 but not on i686 and works only in the aligned case with explicit SIMD code that deals with both the aligned and unaligned cases.
When the input is all ASCII, the function should stay in either the aligned-case or the unaligned-case inner loop that loads 16 bytes using movdqa
or movdqu
, respectively, performs pmovmskb
on the xmm register and compares the result to zero jumping back to the start of the loop if it is zero.
When compiled for i686 Linux with opt level 2 (which Firefox uses) using Rust 1.24.0, the result is exactly as expected.
Unaligned:
.LBB12_3:
movdqu (%edx,%eax), %xmm0
pmovmskb %xmm0, %ebp
testl %ebp, %ebp
jne .LBB12_9
addl $16, %eax
cmpl %ebx, %eax
jbe .LBB12_3
jmp .LBB12_5
.p2align 4, 0x90
Aligned:
.LBB12_7:
movdqa (%edx,%eax), %xmm0
pmovmskb %xmm0, %ebp
testl %ebp, %ebp
jne .LBB12_9
addl $16, %eax
cmpl %ebx, %eax
jbe .LBB12_7
.p2align 4, 0x90
(Windows wouldn't let me see the asm due to LLVM deeming the IR invalid with --emit asm
.)
When compiled with Rust 1.25.0, the result is more complicated:
- There are two instances of
movdqa
and two instances ofmovdqu
suggesting that the first trip through the loop has been unrolled to be a separate copy from the loop proper. - In the actual loop, ALU instructions have been moved around including placing one between the SSE2 instructions.
Both of these transformations look like plausible optimizations, but considering the performance result from Firefox CI, it seems these transformations made performance worse.
.LBB16_1:
movl %edx, %ebp
leal (%ecx,%edi), %ebx
movl $0, %esi
subl %edi, %ebp
cmpl $16, %ebp
jb .LBB16_22
leal -16(%ebp), %eax
testb $15, %bl
movl %eax, 20(%esp)
je .LBB16_9
movdqu (%ebx), %xmm0
movl %edx, 12(%esp)
xorl %eax, %eax
pmovmskb %xmm0, %edx
testl %edx, %edx
jne .LBB16_7
movl 24(%esp), %eax
xorl %esi, %esi
leal (%eax,%edi), %ecx
.p2align 4, 0x90
.LBB16_5:
leal 16(%esi), %eax
cmpl 20(%esp), %eax
ja .LBB16_20
movdqu (%ecx,%esi), %xmm0
movl %eax, %esi
pmovmskb %xmm0, %edx
testl %edx, %edx
je .LBB16_5
.LBB16_7:
testl %edx, %edx
je .LBB16_12
bsfl %edx, %esi
jmp .LBB16_13
.LBB16_9:
movdqa (%ebx), %xmm0
xorl %ecx, %ecx
pmovmskb %xmm0, %eax
testl %eax, %eax
je .LBB16_15
testl %eax, %eax
je .LBB16_19
.LBB16_11:
bsfl %eax, %esi
addl %ecx, %esi
jmp .LBB16_14
.LBB16_12:
movl $32, %esi
.LBB16_13:
movl 12(%esp), %edx
addl %eax, %esi
.LBB16_14:
movb (%ebx,%esi), %al
jmp .LBB16_24
.LBB16_15:
movl $16, %esi
.p2align 4, 0x90
.LBB16_16:
cmpl 20(%esp), %esi
ja .LBB16_22
movdqa (%ebx,%esi), %xmm0
addl $16, %esi
pmovmskb %xmm0, %eax
testl %eax, %eax
je .LBB16_16
addl $-16, %esi
movl %esi, %ecx
testl %eax, %eax
jne .LBB16_11
The asm was obtained by compiling encoding_rs (Firefox uses 0.7.2) using RUSTC_BOOTSTRAP=1 RUSTFLAGS='-C opt-level=2 --emit asm' cargo build --target i686-unknown-linux-gnu --release --features simd-accel
and searching for utf8_valid_up_to
in the .s
file.