Description
This was found while building rustc 1.33.0 on powerpc64-unknown-linux-musl
. Test run-pass/simd/simd-intrinsic-generic-select
fails at the third simd_select_bitmask
test:
thread 'main' panicked at 'assertion failed: `(left == right)`
left: `u32x8(8, 1, 10, 3, 12, 5, 14, 7)`,
right: `u32x8(0, 9, 2, 11, 4, 13, 6, 15)`', src/test/run-pass/simd/simd-intrinsic-generic-select.rs:159:9
The LLVM IR generated for this test case on x86_64-unknown-linux-musl
and powerpc64-unknown-linux-musl
(respectively) is identical:
%1036 = load <8 x i32>, <8 x i32>* %a83, align 32
%1037 = load <8 x i32>, <8 x i32>* %b84, align 32
%1038 = select <8 x i1> bitcast (<1 x i8> <i8 -16> to <8 x i1>), <8 x i32> %1036, <8 x i32> %1037
store <8 x i32> %1038, <8 x i32>* %r101, align 32
%1036 = load <8 x i32>, <8 x i32>* %a83, align 32
%1037 = load <8 x i32>, <8 x i32>* %b84, align 32
%1038 = select <8 x i1> bitcast (<1 x i8> <i8 -16> to <8 x i1>), <8 x i32> %1036, <8 x i32> %1037
store <8 x i32> %1038, <8 x i32>* %r101, align 32
The test appears to expect that the bitmask is interpreted as bitwise little-endian (in other words, the LSB selects the first element from the vectors). However, the implementation uses a bitcast to a vector of i1
. On big-endian architectures such as powerpc64, the LSB becomes the last element of this i1
vector, not the first.
Unfortunately, all of the upstream test cases are symmetrical, so "choosing the wrong vector" and "reading the bitmask backwards" are indistinguishable. I added an additional test case:
let r: u32x8 = simd_select_bitmask(0b11110101u8, a, b);
let e = u32x8(0, 9, 2, 11, 4, 5, 6, 7);
assert_eq!(r, e);
This passes on x86_64-unknown-linux-musl
, but fails on powerpc64-unknown-linux-musl
with:
thread 'main' panicked at 'assertion failed: `(left == right)`
left: `u32x8(0, 1, 2, 3, 12, 5, 14, 7)`,
right: `u32x8(0, 9, 2, 11, 4, 5, 6, 7)`', src/test/run-pass/simd/simd-intrinsic-generic-select.rs:50:9
The two "unlike" elements were chosen from the wrong place in the vector.
Since the vectors are 256 bits, and the POWER VSX registers are only 128 bits wide, LLVM must split each vector across two registers. The following powerpc64 assembly is from the last test case (0b11110000u8
), and clearly shows that LLVM (not the hardware) picks the first half of a
and the second half of b
:
0x000000000000416c <+36>: li r3,0
36 unsafe {
37 let a = u32x8(0, 1, 2, 3, 4, 5, 6, 7);
0x0000000000004170 <+40>: stw r3,288(r1)
0x0000000000004174 <+44>: li r3,1
0x0000000000004178 <+48>: stw r3,292(r1)
0x000000000000417c <+52>: li r3,2
0x0000000000004180 <+56>: stw r3,296(r1)
0x0000000000004184 <+60>: li r3,3
0x0000000000004188 <+64>: stw r3,300(r1)
0x000000000000418c <+68>: li r3,4
0x0000000000004190 <+72>: stw r3,304(r1)
0x0000000000004194 <+76>: li r3,5
0x0000000000004198 <+80>: stw r3,308(r1)
0x000000000000419c <+84>: li r3,6
0x00000000000041a0 <+88>: stw r3,312(r1)
0x00000000000041a4 <+92>: li r3,7
0x00000000000041a8 <+96>: stw r3,316(r1)
0x00000000000041ac <+100>: li r3,8
38 let b = u32x8(8, 9, 10, 11, 12, 13, 14, 15);
0x00000000000041b0 <+104>: stw r3,320(r1)
0x00000000000041b4 <+108>: li r3,9
0x00000000000041b8 <+112>: stw r3,324(r1)
0x00000000000041bc <+116>: li r3,10
0x00000000000041c0 <+120>: stw r3,328(r1)
0x00000000000041c4 <+124>: li r3,11
0x00000000000041c8 <+128>: stw r3,332(r1)
0x00000000000041cc <+132>: addi r3,r1,336
0x00000000000041d0 <+136>: li r4,12
0x00000000000041d4 <+140>: stw r4,336(r1)
0x00000000000041d8 <+144>: li r4,13
0x00000000000041dc <+148>: stw r4,340(r1)
0x00000000000041e0 <+152>: li r4,14
0x00000000000041e4 <+156>: stw r4,344(r1)
0x00000000000041e8 <+160>: li r4,15
0x00000000000041ec <+164>: stw r4,348(r1)
...
60 let r: u32x8 = simd_select_bitmask(0b11110000u8, a, b);
0x00000000000048cc <+1924>: addi r3,r1,288
0x00000000000048d0 <+1928>: lvx v2,0,r3
0x00000000000048d4 <+1932>: addi r3,r1,336
0x00000000000048d8 <+1936>: lvx v3,0,r3
0x00000000000048dc <+1940>: addi r3,r1,1488
0x00000000000048e0 <+1944>: stvx v3,0,r3
0x00000000000048e4 <+1948>: addi r3,r1,1472
0x00000000000048e8 <+1952>: stvx v2,0,r3
So is this a bug in the test, because it should be ensuring that the bitmask is in native vector/endian order? Or in the implementation of simd_select_bitmask
, because it should always take a little-endian bitmask and reverse the bits as necessary?