Skip to content

s390x: __builtin_reduce_and does not optimize well #129434

Closed
@folkertdev

Description

@folkertdev

given this C code

https://godbolt.org/z/WvfG8TTxf

#include <vecintrin.h>
#include <stdbool.h>

bool vectors_equal_builtin(vector int a, vector int b) {
    return vec_all_eq(a, b);
}

typedef int vec4i __attribute__((vector_size(16)));

bool vectors_equal_manual(vec4i a, vec4i b) {
    return __builtin_reduce_and(a == b);
}

The manual implementation fails to optimize to the builtin one.

vectors_equal_builtin:
        vceqfs  %v0, %v24, %v26
        lghi    %r2, 0
        locghie %r2, 1
        br      %r14

vectors_equal_manual:
        aghi    %r15, -168
        vceqf   %v0, %v24, %v26
        vno     %v0, %v0, %v0
        vlgvf   %r1, %v0, 0
        vlgvf   %r0, %v0, 1
        sll     %r1, 3
        rosbg   %r1, %r0, 61, 61, 2
        vlgvf   %r0, %v0, 2
        rosbg   %r1, %r0, 62, 62, 1
        vlgvf   %r0, %v0, 3
        rosbg   %r1, %r0, 63, 63, 0
        tmll    %r1, 15
        lghi    %r2, 0
        locghie %r2, 1
        aghi    %r15, 168
        br      %r14
define dso_local noundef zeroext i1 @vectors_equal_manual(<4 x i32> noundef %a, <4 x i32> noundef %b) local_unnamed_addr {
entry:
  %0 = icmp ne <4 x i32> %a, %b
  %1 = bitcast <4 x i1> %0 to i4
  %2 = icmp eq i4 %1, 0
  ret i1 %2
}

There are many varitions on vec_all_eq (see https://www.ibm.com/docs/en/zos/2.4.0?topic=functions-any-predicates), and it would be neat if those all optimized. It might be possible to simplify clang's vecintrin.h too.

This came up while implementing vec_all_eq in the rust standard library, where fewer custom intrinsics are better in every way.

cc @uweigand (posted here so it can be linked to)

Metadata

Metadata

Assignees

No one assigned

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions