s390x: extracting an element at a non-`const` index from a SIMD vector generates bad code

I'm trying to add an implementation of `vec_extract` for the `s390x-unknown-linux-gnu` target in stdarch:

https://godbolt.org/z/e65Mvf5vM

```c
#include <vecintrin.h>

unsigned int foo(vector unsigned int a, signed int b) { 
    return vec_extract(a, b);
}
```

Turns into this LLVM

```llvm
define dso_local zeroext i32 @foo(<4 x i32> noundef %a, i32 noundef signext %b) local_unnamed_addr {
entry:
  %and.i = and i32 %b, 3
  %vecext.i = extractelement <4 x i32> %a, i32 %and.i
  ret i32 %vecext.i
}
```

And generates the following assembly

```asm
foo:
        nilf    %r2, 3              # bitwise and with 0b11 to prevent UB due to index out of bounds
        vlgvf   %r0, %v24, 0(%r2)   # put the element at index %r2 from the vector %24 into %r0
        llgfr   %r2, %r0            # zero-extend that value, put it into %r2
        br      %r14                # ret
```

in particular, the `vlgvf` (`f` for fullword, there are variations for other widths) is the relevant instruction here. It extracts the value at the given index. 

Contrary to most other targets, the index argument to a `vec_extract` **does not** need to be `const`. The `std::intrinsics::simd::simd_extract` function **does** need its index argument to be const, and therefore can't straightforwardly be used to implement `vec_extract`.

## Attempt 1

I tried simple field extraction:

https://godbolt.org/z/sbhYj316x

```rust
#![feature(core_intrinsics, repr_simd, s390x_target_feature)]

#[repr(simd)]
struct vector_unsigned_int([u32; 4]);

#[no_mangle]
unsafe extern "C" fn extract_element_bad(v: vector_unsigned_int, idx: u32) -> u32 {
    v.0[idx as usize % 4]
}
```
(`extern "C"` is used so that the vector is passed by-value, but we see the same assembly when the vector is created within the function)

Indexing into the underlying array in this way [may soon be banned](https://github.com/rust-lang/compiler-team/issues/838), though there is [an alternative approach](https://github.com/rust-lang/rust/blob/e72df78268bafbfe28366aa0e382713f55a491e7/library/portable-simd/crates/core_simd/src/vector.rs#L186-L194) that does the same thing. Unfortunately, this version does not optimize well:

```llvm
define noundef zeroext i32 @extract_element_bad(<4 x i32> %0, i32 noundef zeroext %idx) unnamed_addr personality ptr @rust_eh_personality {
start:
  %v = alloca [16 x i8], align 8
  store <4 x i32> %0, ptr %v, align 8
  %1 = and i32 %idx, 3
  %_3 = zext nneg i32 %1 to i64
  %2 = getelementptr inbounds nuw i32, ptr %v, i64 %_3
  %_0 = load i32, ptr %2, align 4
  ret i32 %_0
}
```

The portable-simd implementation of `Index` appears to be doing [the same thing](https://doc.rust-lang.org/src/core/portable-simd/crates/core_simd/src/ops.rs.html#21), and generates the same code https://godbolt.org/z/eecM6qdbW. That totally makes sense for most targets, because a pointer load is the best you can do.

## Attempt 2

I did find that this version does optimize well

```rust
#[no_mangle]
unsafe extern "C" fn extract_element_good(v: vector_unsigned_int, idx: u32) -> u32 {
    match idx % 4 {
        0 => simd_extract(v, 0),
        1 => simd_extract(v, 1),
        2 => simd_extract(v, 2),
        3 => simd_extract(v, 3),
        _ => unreachable!(),
    }
}
```

```llvm
define noundef zeroext i32 @extract_element_good(<4 x i32> %v, i32 noundef zeroext %idx) unnamed_addr personality ptr @rust_eh_personality {
start:
  %_3 = and i32 %idx, 3
  %switch.idx.cast = zext nneg i32 %_3 to i64
  %0 = extractelement <4 x i32> %v, i64 %switch.idx.cast
  ret i32 %0
}
```

but that is unwieldy kind of unwieldy, and while I can make it work for `stdarch` it won't work for `portable_simd`.

## Solutions

I think there should be a way of indexing into a vector that emits an `extractelement` rather than a `getelementptr`. Semantically that seems more accurate (and might optimize better in some cases?), even though on most targets the generated assembly will be the same.

Some things I'm not sure about

- how big a deal this is for s390x (cc @uweigand do you know how important generating `vlgvf` is?)
- maybe adding the extra intrinsic does still run into the complexity that https://github.com/rust-lang/compiler-team/issues/838 wants to fix
- maybe extracting with a non-`const` value is a terrible idea on some/most targets and unimplemented by design

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

s390x: extracting an element at a non-`const` index from a SIMD vector generates bad code #137372

Attempt 1

Attempt 2

Solutions

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

s390x: extracting an element at a non-const index from a SIMD vector generates bad code #137372

Description

Attempt 1

Attempt 2

Solutions

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions

s390x: extracting an element at a non-`const` index from a SIMD vector generates bad code #137372