Description
I'm trying to add an implementation of vec_extract
for the s390x-unknown-linux-gnu
target in stdarch:
https://godbolt.org/z/e65Mvf5vM
#include <vecintrin.h>
unsigned int foo(vector unsigned int a, signed int b) {
return vec_extract(a, b);
}
Turns into this LLVM
define dso_local zeroext i32 @foo(<4 x i32> noundef %a, i32 noundef signext %b) local_unnamed_addr {
entry:
%and.i = and i32 %b, 3
%vecext.i = extractelement <4 x i32> %a, i32 %and.i
ret i32 %vecext.i
}
And generates the following assembly
foo:
nilf %r2, 3 # bitwise and with 0b11 to prevent UB due to index out of bounds
vlgvf %r0, %v24, 0(%r2) # put the element at index %r2 from the vector %24 into %r0
llgfr %r2, %r0 # zero-extend that value, put it into %r2
br %r14 # ret
in particular, the vlgvf
(f
for fullword, there are variations for other widths) is the relevant instruction here. It extracts the value at the given index.
Contrary to most other targets, the index argument to a vec_extract
does not need to be const
. The std::intrinsics::simd::simd_extract
function does need its index argument to be const, and therefore can't straightforwardly be used to implement vec_extract
.
Attempt 1
I tried simple field extraction:
https://godbolt.org/z/sbhYj316x
#![feature(core_intrinsics, repr_simd, s390x_target_feature)]
#[repr(simd)]
struct vector_unsigned_int([u32; 4]);
#[no_mangle]
unsafe extern "C" fn extract_element_bad(v: vector_unsigned_int, idx: u32) -> u32 {
v.0[idx as usize % 4]
}
(extern "C"
is used so that the vector is passed by-value, but we see the same assembly when the vector is created within the function)
Indexing into the underlying array in this way may soon be banned, though there is an alternative approach that does the same thing. Unfortunately, this version does not optimize well:
define noundef zeroext i32 @extract_element_bad(<4 x i32> %0, i32 noundef zeroext %idx) unnamed_addr personality ptr @rust_eh_personality {
start:
%v = alloca [16 x i8], align 8
store <4 x i32> %0, ptr %v, align 8
%1 = and i32 %idx, 3
%_3 = zext nneg i32 %1 to i64
%2 = getelementptr inbounds nuw i32, ptr %v, i64 %_3
%_0 = load i32, ptr %2, align 4
ret i32 %_0
}
The portable-simd implementation of Index
appears to be doing the same thing, and generates the same code https://godbolt.org/z/eecM6qdbW. That totally makes sense for most targets, because a pointer load is the best you can do.
Attempt 2
I did find that this version does optimize well
#[no_mangle]
unsafe extern "C" fn extract_element_good(v: vector_unsigned_int, idx: u32) -> u32 {
match idx % 4 {
0 => simd_extract(v, 0),
1 => simd_extract(v, 1),
2 => simd_extract(v, 2),
3 => simd_extract(v, 3),
_ => unreachable!(),
}
}
define noundef zeroext i32 @extract_element_good(<4 x i32> %v, i32 noundef zeroext %idx) unnamed_addr personality ptr @rust_eh_personality {
start:
%_3 = and i32 %idx, 3
%switch.idx.cast = zext nneg i32 %_3 to i64
%0 = extractelement <4 x i32> %v, i64 %switch.idx.cast
ret i32 %0
}
but that is unwieldy kind of unwieldy, and while I can make it work for stdarch
it won't work for portable_simd
.
Solutions
I think there should be a way of indexing into a vector that emits an extractelement
rather than a getelementptr
. Semantically that seems more accurate (and might optimize better in some cases?), even though on most targets the generated assembly will be the same.
Some things I'm not sure about
- how big a deal this is for s390x (cc @uweigand do you know how important generating
vlgvf
is?) - maybe adding the extra intrinsic does still run into the complexity that Ban projecting into
repr(simd)
types compiler-team#838 wants to fix - maybe extracting with a non-
const
value is a terrible idea on some/most targets and unimplemented by design