Skip to content

s390x: extracting an element at a non-const index from a SIMD vector generates bad code #137372

Closed
@folkertdev

Description

@folkertdev

I'm trying to add an implementation of vec_extract for the s390x-unknown-linux-gnu target in stdarch:

https://godbolt.org/z/e65Mvf5vM

#include <vecintrin.h>

unsigned int foo(vector unsigned int a, signed int b) { 
    return vec_extract(a, b);
}

Turns into this LLVM

define dso_local zeroext i32 @foo(<4 x i32> noundef %a, i32 noundef signext %b) local_unnamed_addr {
entry:
  %and.i = and i32 %b, 3
  %vecext.i = extractelement <4 x i32> %a, i32 %and.i
  ret i32 %vecext.i
}

And generates the following assembly

foo:
        nilf    %r2, 3              # bitwise and with 0b11 to prevent UB due to index out of bounds
        vlgvf   %r0, %v24, 0(%r2)   # put the element at index %r2 from the vector %24 into %r0
        llgfr   %r2, %r0            # zero-extend that value, put it into %r2
        br      %r14                # ret

in particular, the vlgvf (f for fullword, there are variations for other widths) is the relevant instruction here. It extracts the value at the given index.

Contrary to most other targets, the index argument to a vec_extract does not need to be const. The std::intrinsics::simd::simd_extract function does need its index argument to be const, and therefore can't straightforwardly be used to implement vec_extract.

Attempt 1

I tried simple field extraction:

https://godbolt.org/z/sbhYj316x

#![feature(core_intrinsics, repr_simd, s390x_target_feature)]

#[repr(simd)]
struct vector_unsigned_int([u32; 4]);

#[no_mangle]
unsafe extern "C" fn extract_element_bad(v: vector_unsigned_int, idx: u32) -> u32 {
    v.0[idx as usize % 4]
}

(extern "C" is used so that the vector is passed by-value, but we see the same assembly when the vector is created within the function)

Indexing into the underlying array in this way may soon be banned, though there is an alternative approach that does the same thing. Unfortunately, this version does not optimize well:

define noundef zeroext i32 @extract_element_bad(<4 x i32> %0, i32 noundef zeroext %idx) unnamed_addr personality ptr @rust_eh_personality {
start:
  %v = alloca [16 x i8], align 8
  store <4 x i32> %0, ptr %v, align 8
  %1 = and i32 %idx, 3
  %_3 = zext nneg i32 %1 to i64
  %2 = getelementptr inbounds nuw i32, ptr %v, i64 %_3
  %_0 = load i32, ptr %2, align 4
  ret i32 %_0
}

The portable-simd implementation of Index appears to be doing the same thing, and generates the same code https://godbolt.org/z/eecM6qdbW. That totally makes sense for most targets, because a pointer load is the best you can do.

Attempt 2

I did find that this version does optimize well

#[no_mangle]
unsafe extern "C" fn extract_element_good(v: vector_unsigned_int, idx: u32) -> u32 {
    match idx % 4 {
        0 => simd_extract(v, 0),
        1 => simd_extract(v, 1),
        2 => simd_extract(v, 2),
        3 => simd_extract(v, 3),
        _ => unreachable!(),
    }
}
define noundef zeroext i32 @extract_element_good(<4 x i32> %v, i32 noundef zeroext %idx) unnamed_addr personality ptr @rust_eh_personality {
start:
  %_3 = and i32 %idx, 3
  %switch.idx.cast = zext nneg i32 %_3 to i64
  %0 = extractelement <4 x i32> %v, i64 %switch.idx.cast
  ret i32 %0
}

but that is unwieldy kind of unwieldy, and while I can make it work for stdarch it won't work for portable_simd.

Solutions

I think there should be a way of indexing into a vector that emits an extractelement rather than a getelementptr. Semantically that seems more accurate (and might optimize better in some cases?), even though on most targets the generated assembly will be the same.

Some things I'm not sure about

  • how big a deal this is for s390x (cc @uweigand do you know how important generating vlgvf is?)
  • maybe adding the extra intrinsic does still run into the complexity that Ban projecting into repr(simd) types compiler-team#838 wants to fix
  • maybe extracting with a non-const value is a terrible idea on some/most targets and unimplemented by design

Metadata

Metadata

Assignees

No one assigned

    Labels

    C-optimizationCategory: An issue highlighting optimization opportunities or PRs implementing suchI-slowIssue: Problems and improvements with respect to performance of generated code.O-SystemZTarget: SystemZ processors (s390x)T-compilerRelevant to the compiler team, which will review and decide on the PR/issue.

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions