Skip to content

Rust 1.25.0 regressed the performance of encoding_rs's UTF-8 validation on i686 #49873

Closed
@hsivonen

Description

@hsivonen

When Firefox switched from Rust 1.24.0 to Rust 1.25.0, the win32 performance of encoding_rs's UTF-8 validation function dropped 12.5% when used on ASCII input. encoding_rs's UTF-8 validation function is a fork of the Rust standard library validation function that replaces the ASCII acceleration ALU trick that autovectorizes on x86_64 but not on i686 and works only in the aligned case with explicit SIMD code that deals with both the aligned and unaligned cases.

When the input is all ASCII, the function should stay in either the aligned-case or the unaligned-case inner loop that loads 16 bytes using movdqa or movdqu, respectively, performs pmovmskb on the xmm register and compares the result to zero jumping back to the start of the loop if it is zero.

When compiled for i686 Linux with opt level 2 (which Firefox uses) using Rust 1.24.0, the result is exactly as expected.

Unaligned:

.LBB12_3:
	movdqu	(%edx,%eax), %xmm0
	pmovmskb	%xmm0, %ebp
	testl	%ebp, %ebp
	jne	.LBB12_9
	addl	$16, %eax
	cmpl	%ebx, %eax
	jbe	.LBB12_3
	jmp	.LBB12_5
	.p2align	4, 0x90

Aligned:

.LBB12_7:
	movdqa	(%edx,%eax), %xmm0
	pmovmskb	%xmm0, %ebp
	testl	%ebp, %ebp
	jne	.LBB12_9
	addl	$16, %eax
	cmpl	%ebx, %eax
	jbe	.LBB12_7
	.p2align	4, 0x90

(Windows wouldn't let me see the asm due to LLVM deeming the IR invalid with --emit asm.)

When compiled with Rust 1.25.0, the result is more complicated:

  1. There are two instances of movdqa and two instances of movdqu suggesting that the first trip through the loop has been unrolled to be a separate copy from the loop proper.
  2. In the actual loop, ALU instructions have been moved around including placing one between the SSE2 instructions.

Both of these transformations look like plausible optimizations, but considering the performance result from Firefox CI, it seems these transformations made performance worse.

.LBB16_1:
	movl	%edx, %ebp
	leal	(%ecx,%edi), %ebx
	movl	$0, %esi
	subl	%edi, %ebp
	cmpl	$16, %ebp
	jb	.LBB16_22
	leal	-16(%ebp), %eax
	testb	$15, %bl
	movl	%eax, 20(%esp)
	je	.LBB16_9
	movdqu	(%ebx), %xmm0
	movl	%edx, 12(%esp)
	xorl	%eax, %eax
	pmovmskb	%xmm0, %edx
	testl	%edx, %edx
	jne	.LBB16_7
	movl	24(%esp), %eax
	xorl	%esi, %esi
	leal	(%eax,%edi), %ecx
	.p2align	4, 0x90
.LBB16_5:
	leal	16(%esi), %eax
	cmpl	20(%esp), %eax
	ja	.LBB16_20
	movdqu	(%ecx,%esi), %xmm0
	movl	%eax, %esi
	pmovmskb	%xmm0, %edx
	testl	%edx, %edx
	je	.LBB16_5
.LBB16_7:
	testl	%edx, %edx
	je	.LBB16_12
	bsfl	%edx, %esi
	jmp	.LBB16_13
.LBB16_9:
	movdqa	(%ebx), %xmm0
	xorl	%ecx, %ecx
	pmovmskb	%xmm0, %eax
	testl	%eax, %eax
	je	.LBB16_15
	testl	%eax, %eax
	je	.LBB16_19
.LBB16_11:
	bsfl	%eax, %esi
	addl	%ecx, %esi
	jmp	.LBB16_14
.LBB16_12:
	movl	$32, %esi
.LBB16_13:
	movl	12(%esp), %edx
	addl	%eax, %esi
.LBB16_14:
	movb	(%ebx,%esi), %al
	jmp	.LBB16_24
.LBB16_15:
	movl	$16, %esi
	.p2align	4, 0x90
.LBB16_16:
	cmpl	20(%esp), %esi
	ja	.LBB16_22
	movdqa	(%ebx,%esi), %xmm0
	addl	$16, %esi
	pmovmskb	%xmm0, %eax
	testl	%eax, %eax
	je	.LBB16_16
	addl	$-16, %esi
	movl	%esi, %ecx
	testl	%eax, %eax
	jne	.LBB16_11

The asm was obtained by compiling encoding_rs (Firefox uses 0.7.2) using RUSTC_BOOTSTRAP=1 RUSTFLAGS='-C opt-level=2 --emit asm' cargo build --target i686-unknown-linux-gnu --release --features simd-accel and searching for utf8_valid_up_to in the .s file.

Metadata

Metadata

Assignees

No one assigned

    Labels

    A-LLVMArea: Code generation parts specific to LLVM. Both correctness bugs and optimization-related issues.I-slowIssue: Problems and improvements with respect to performance of generated code.P-mediumMedium priorityWG-llvmWorking group: LLVM backend code generationregression-from-stable-to-stablePerformance or correctness regression from one stable version to another.

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions