Skip to content

[RISCV] Vectorize phi for loop carried @llvm.vp.reduce.* #131974

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 3 commits into from
Mar 31, 2025

Conversation

NexMing
Copy link
Contributor

@NexMing NexMing commented Mar 19, 2025

LLVM vector predication reduction intrinsics return a scalar result, but on RISC-V vector reduction instructions write the result in the first element of a vector register. So when a reduction in a loop uses a scalar phi, we end up with unnecessary scalar moves:

loop:
    vmv.s.x v8, zero
    vredsum.vs v8, v10, v8
    vmv.x.s a0, v8

This mainly affects vector predication reduction. This tries to vectorize any scalar phis that feed into a vector predication reduction in RISCVCodeGenPrepare, converting:

vector.body:
%red.phi = phi i32 [ ..., %entry ], [ %red, %vector.body ]
%red = tail call i32 @llvm.vp.reduce.add.nxv4i32(i32 %red.phi, <vscale x 4 x i32> %wide.load, <vscale x 4 x i1> splat (i1 true), i32 %evl)

to

vector.body:
%red.phi = phi <vscale x 2 x i32> [ ..., %entry ], [ %acc.vec, %vector.body]
%phi.scalar = extractelement <vscale x 2 x i32> %red.phi, i64 0
%acc = tail call i32 @llvm.vp.reduce.add.nxv4i32(i32 %phi.scalar, <vscale x 4 x i32> %wide.load, <vscale x 4 x i1> splat (i1 true), i32 %evl)
%acc.vec = insertelement <vscale x 2 x i32> poison, float %acc, i64 0

Which eliminates the scalar -> vector -> scalar crossing during instruction selection.

@llvmbot
Copy link
Member

llvmbot commented Mar 19, 2025

@llvm/pr-subscribers-backend-risc-v

Author: MingYan (NexMing)

Changes

This patch is vector predication version of commit 15b0fab


Patch is 52.25 KiB, truncated to 20.00 KiB below, full version: https://github.com/llvm/llvm-project/pull/131974.diff

3 Files Affected:

  • (modified) llvm/lib/Target/RISCV/RISCVCodeGenPrepare.cpp (+2-1)
  • (modified) llvm/test/CodeGen/RISCV/rvv/riscv-codegenprepare-asm.ll (+456)
  • (modified) llvm/test/CodeGen/RISCV/rvv/riscv-codegenprepare.ll (+484)
diff --git a/llvm/lib/Target/RISCV/RISCVCodeGenPrepare.cpp b/llvm/lib/Target/RISCV/RISCVCodeGenPrepare.cpp
index 5be5345cca73a..39877fb511ec3 100644
--- a/llvm/lib/Target/RISCV/RISCVCodeGenPrepare.cpp
+++ b/llvm/lib/Target/RISCV/RISCVCodeGenPrepare.cpp
@@ -137,7 +137,8 @@ bool RISCVCodeGenPrepare::visitIntrinsicInst(IntrinsicInst &I) {
   if (expandVPStrideLoad(I))
     return true;
 
-  if (I.getIntrinsicID() != Intrinsic::vector_reduce_fadd)
+  if (I.getIntrinsicID() != Intrinsic::vector_reduce_fadd &&
+      !isa<VPReductionIntrinsic>(&I))
     return false;
 
   auto *PHI = dyn_cast<PHINode>(I.getOperand(0));
diff --git a/llvm/test/CodeGen/RISCV/rvv/riscv-codegenprepare-asm.ll b/llvm/test/CodeGen/RISCV/rvv/riscv-codegenprepare-asm.ll
index 3bbdd1a257fdb..4e5f6e0f65489 100644
--- a/llvm/test/CodeGen/RISCV/rvv/riscv-codegenprepare-asm.ll
+++ b/llvm/test/CodeGen/RISCV/rvv/riscv-codegenprepare-asm.ll
@@ -42,3 +42,459 @@ vector.body:
 exit:
   ret float %acc
 }
+
+define i32 @vp_reduce_add(ptr %a) {
+; CHECK-LABEL: vp_reduce_add:
+; CHECK:       # %bb.0: # %entry
+; CHECK-NEXT:    li a1, 0
+; CHECK-NEXT:    vsetvli a2, zero, e32, m1, ta, ma
+; CHECK-NEXT:    vmv.s.x v8, zero
+; CHECK-NEXT:    li a2, 1024
+; CHECK-NEXT:  .LBB1_1: # %vector.body
+; CHECK-NEXT:    # =>This Inner Loop Header: Depth=1
+; CHECK-NEXT:    vsetvli a3, a2, e32, m2, ta, ma
+; CHECK-NEXT:    slli a4, a1, 2
+; CHECK-NEXT:    add a4, a0, a4
+; CHECK-NEXT:    vle32.v v10, (a4)
+; CHECK-NEXT:    sub a2, a2, a3
+; CHECK-NEXT:    vredsum.vs v8, v10, v8
+; CHECK-NEXT:    add a1, a1, a3
+; CHECK-NEXT:    bnez a2, .LBB1_1
+; CHECK-NEXT:  # %bb.2: # %for.cond.cleanup
+; CHECK-NEXT:    vmv.x.s a0, v8
+; CHECK-NEXT:    ret
+entry:
+  br label %vector.body
+
+vector.body:                                      ; preds = %vector.body, %entry
+  %trip.count = phi i64 [ 1024, %entry ], [ %remaining.trip.count, %vector.body ]
+  %scalar.ind = phi i64 [ 0, %entry ], [ %next.ind, %vector.body ]
+  %red.phi = phi i32 [ 0, %entry ], [ %red, %vector.body ]
+  %evl = tail call i32 @llvm.experimental.get.vector.length.i64(i64 %trip.count, i32 4, i1 true)
+  %evl2 = zext i32 %evl to i64
+  %arrayidx6 = getelementptr inbounds i32, ptr %a, i64 %scalar.ind
+  %wide.load = tail call <vscale x 4 x i32> @llvm.vp.load.nxv4i32.p0(ptr %arrayidx6, <vscale x 4 x i1> splat (i1 true), i32 %evl)
+  %red = tail call i32 @llvm.vp.reduce.add.nxv4i32(i32 %red.phi, <vscale x 4 x i32> %wide.load, <vscale x 4 x i1> splat (i1 true), i32 %evl)
+  %remaining.trip.count = sub nuw i64 %trip.count, %evl2
+  %next.ind = add i64 %scalar.ind, %evl2
+  %m = icmp eq i64 %remaining.trip.count, 0
+  br i1 %m, label %for.cond.cleanup, label %vector.body
+
+for.cond.cleanup:                                 ; preds = %vector.body
+  ret i32 %red
+}
+
+define i32 @vp_reduce_and(ptr %a) {
+; CHECK-LABEL: vp_reduce_and:
+; CHECK:       # %bb.0: # %entry
+; CHECK-NEXT:    li a1, 0
+; CHECK-NEXT:    lui a2, 524288
+; CHECK-NEXT:    vsetvli a3, zero, e32, m1, ta, ma
+; CHECK-NEXT:    vmv.s.x v8, a2
+; CHECK-NEXT:    li a2, 1024
+; CHECK-NEXT:  .LBB2_1: # %vector.body
+; CHECK-NEXT:    # =>This Inner Loop Header: Depth=1
+; CHECK-NEXT:    vsetvli a3, a2, e32, m2, ta, ma
+; CHECK-NEXT:    slli a4, a1, 2
+; CHECK-NEXT:    add a4, a0, a4
+; CHECK-NEXT:    vle32.v v10, (a4)
+; CHECK-NEXT:    sub a2, a2, a3
+; CHECK-NEXT:    vredand.vs v8, v10, v8
+; CHECK-NEXT:    add a1, a1, a3
+; CHECK-NEXT:    bnez a2, .LBB2_1
+; CHECK-NEXT:  # %bb.2: # %for.cond.cleanup
+; CHECK-NEXT:    vmv.x.s a0, v8
+; CHECK-NEXT:    ret
+entry:
+  br label %vector.body
+
+vector.body:                                      ; preds = %vector.body, %entry
+  %trip.count = phi i64 [ 1024, %entry ], [ %remaining.trip.count, %vector.body ]
+  %scalar.ind = phi i64 [ 0, %entry ], [ %next.ind, %vector.body ]
+  %red.phi = phi i32 [ -2147483648, %entry ], [ %red, %vector.body ]
+  %evl = tail call i32 @llvm.experimental.get.vector.length.i64(i64 %trip.count, i32 4, i1 true)
+  %evl2 = zext i32 %evl to i64
+  %arrayidx6 = getelementptr inbounds i32, ptr %a, i64 %scalar.ind
+  %wide.load = tail call <vscale x 4 x i32> @llvm.vp.load.nxv4i32.p0(ptr %arrayidx6, <vscale x 4 x i1> splat (i1 true), i32 %evl)
+  %red = tail call i32 @llvm.vp.reduce.and.nxv4i32(i32 %red.phi, <vscale x 4 x i32> %wide.load, <vscale x 4 x i1> splat (i1 true), i32 %evl)
+  %remaining.trip.count = sub nuw i64 %trip.count, %evl2
+  %next.ind = add i64 %scalar.ind, %evl2
+  %m = icmp eq i64 %remaining.trip.count, 0
+  br i1 %m, label %for.cond.cleanup, label %vector.body
+
+for.cond.cleanup:                                 ; preds = %vector.body
+  ret i32 %red
+}
+
+define i32 @vp_reduce_or(ptr %a) {
+; CHECK-LABEL: vp_reduce_or:
+; CHECK:       # %bb.0: # %entry
+; CHECK-NEXT:    li a1, 0
+; CHECK-NEXT:    vsetvli a2, zero, e32, m1, ta, ma
+; CHECK-NEXT:    vmv.s.x v8, zero
+; CHECK-NEXT:    li a2, 1024
+; CHECK-NEXT:  .LBB3_1: # %vector.body
+; CHECK-NEXT:    # =>This Inner Loop Header: Depth=1
+; CHECK-NEXT:    vsetvli a3, a2, e32, m2, ta, ma
+; CHECK-NEXT:    slli a4, a1, 2
+; CHECK-NEXT:    add a4, a0, a4
+; CHECK-NEXT:    vle32.v v10, (a4)
+; CHECK-NEXT:    sub a2, a2, a3
+; CHECK-NEXT:    vredor.vs v8, v10, v8
+; CHECK-NEXT:    add a1, a1, a3
+; CHECK-NEXT:    bnez a2, .LBB3_1
+; CHECK-NEXT:  # %bb.2: # %for.cond.cleanup
+; CHECK-NEXT:    vmv.x.s a0, v8
+; CHECK-NEXT:    ret
+entry:
+  br label %vector.body
+
+vector.body:                                      ; preds = %vector.body, %entry
+  %trip.count = phi i64 [ 1024, %entry ], [ %remaining.trip.count, %vector.body ]
+  %scalar.ind = phi i64 [ 0, %entry ], [ %next.ind, %vector.body ]
+  %red.phi = phi i32 [ 0, %entry ], [ %red, %vector.body ]
+  %evl = tail call i32 @llvm.experimental.get.vector.length.i64(i64 %trip.count, i32 4, i1 true)
+  %evl2 = zext i32 %evl to i64
+  %arrayidx6 = getelementptr inbounds i32, ptr %a, i64 %scalar.ind
+  %wide.load = tail call <vscale x 4 x i32> @llvm.vp.load.nxv4i32.p0(ptr %arrayidx6, <vscale x 4 x i1> splat (i1 true), i32 %evl)
+  %red = tail call i32 @llvm.vp.reduce.or.nxv4i32(i32 %red.phi, <vscale x 4 x i32> %wide.load, <vscale x 4 x i1> splat (i1 true), i32 %evl)
+  %remaining.trip.count = sub nuw i64 %trip.count, %evl2
+  %next.ind = add i64 %scalar.ind, %evl2
+  %m = icmp eq i64 %remaining.trip.count, 0
+  br i1 %m, label %for.cond.cleanup, label %vector.body
+
+for.cond.cleanup:                                 ; preds = %vector.body
+  ret i32 %red
+}
+
+define i32 @vp_reduce_xor(ptr %a) {
+; CHECK-LABEL: vp_reduce_xor:
+; CHECK:       # %bb.0: # %entry
+; CHECK-NEXT:    li a1, 0
+; CHECK-NEXT:    vsetvli a2, zero, e32, m1, ta, ma
+; CHECK-NEXT:    vmv.s.x v8, zero
+; CHECK-NEXT:    li a2, 1024
+; CHECK-NEXT:  .LBB4_1: # %vector.body
+; CHECK-NEXT:    # =>This Inner Loop Header: Depth=1
+; CHECK-NEXT:    vsetvli a3, a2, e32, m2, ta, ma
+; CHECK-NEXT:    slli a4, a1, 2
+; CHECK-NEXT:    add a4, a0, a4
+; CHECK-NEXT:    vle32.v v10, (a4)
+; CHECK-NEXT:    sub a2, a2, a3
+; CHECK-NEXT:    vredxor.vs v8, v10, v8
+; CHECK-NEXT:    add a1, a1, a3
+; CHECK-NEXT:    bnez a2, .LBB4_1
+; CHECK-NEXT:  # %bb.2: # %for.cond.cleanup
+; CHECK-NEXT:    vmv.x.s a0, v8
+; CHECK-NEXT:    ret
+entry:
+  br label %vector.body
+
+vector.body:                                      ; preds = %vector.body, %entry
+  %trip.count = phi i64 [ 1024, %entry ], [ %remaining.trip.count, %vector.body ]
+  %scalar.ind = phi i64 [ 0, %entry ], [ %next.ind, %vector.body ]
+  %red.phi = phi i32 [ 0, %entry ], [ %red, %vector.body ]
+  %evl = tail call i32 @llvm.experimental.get.vector.length.i64(i64 %trip.count, i32 4, i1 true)
+  %evl2 = zext i32 %evl to i64
+  %arrayidx6 = getelementptr inbounds i32, ptr %a, i64 %scalar.ind
+  %wide.load = tail call <vscale x 4 x i32> @llvm.vp.load.nxv4i32.p0(ptr %arrayidx6, <vscale x 4 x i1> splat (i1 true), i32 %evl)
+  %red = tail call i32 @llvm.vp.reduce.xor.nxv4i32(i32 %red.phi, <vscale x 4 x i32> %wide.load, <vscale x 4 x i1> splat (i1 true), i32 %evl)
+  %remaining.trip.count = sub nuw i64 %trip.count, %evl2
+  %next.ind = add i64 %scalar.ind, %evl2
+  %m = icmp eq i64 %remaining.trip.count, 0
+  br i1 %m, label %for.cond.cleanup, label %vector.body
+
+for.cond.cleanup:                                 ; preds = %vector.body
+  ret i32 %red
+}
+
+define i32 @vp_reduce_smax(ptr %a) {
+; CHECK-LABEL: vp_reduce_smax:
+; CHECK:       # %bb.0: # %entry
+; CHECK-NEXT:    li a1, 0
+; CHECK-NEXT:    lui a2, 524288
+; CHECK-NEXT:    vsetvli a3, zero, e32, m1, ta, ma
+; CHECK-NEXT:    vmv.s.x v8, a2
+; CHECK-NEXT:    li a2, 1024
+; CHECK-NEXT:  .LBB5_1: # %vector.body
+; CHECK-NEXT:    # =>This Inner Loop Header: Depth=1
+; CHECK-NEXT:    vsetvli a3, a2, e32, m2, ta, ma
+; CHECK-NEXT:    slli a4, a1, 2
+; CHECK-NEXT:    add a4, a0, a4
+; CHECK-NEXT:    vle32.v v10, (a4)
+; CHECK-NEXT:    sub a2, a2, a3
+; CHECK-NEXT:    vredmax.vs v8, v10, v8
+; CHECK-NEXT:    add a1, a1, a3
+; CHECK-NEXT:    bnez a2, .LBB5_1
+; CHECK-NEXT:  # %bb.2: # %for.cond.cleanup
+; CHECK-NEXT:    vmv.x.s a0, v8
+; CHECK-NEXT:    ret
+entry:
+  br label %vector.body
+
+vector.body:                                      ; preds = %vector.body, %entry
+  %trip.count = phi i64 [ 1024, %entry ], [ %remaining.trip.count, %vector.body ]
+  %scalar.ind = phi i64 [ 0, %entry ], [ %next.ind, %vector.body ]
+  %red.phi = phi i32 [ -2147483648, %entry ], [ %red, %vector.body ]
+  %evl = tail call i32 @llvm.experimental.get.vector.length.i64(i64 %trip.count, i32 4, i1 true)
+  %evl2 = zext i32 %evl to i64
+  %arrayidx6 = getelementptr inbounds i32, ptr %a, i64 %scalar.ind
+  %wide.load = tail call <vscale x 4 x i32> @llvm.vp.load.nxv4i32.p0(ptr %arrayidx6, <vscale x 4 x i1> splat (i1 true), i32 %evl)
+  %red = tail call i32 @llvm.vp.reduce.smax.nxv4i32(i32 %red.phi, <vscale x 4 x i32> %wide.load, <vscale x 4 x i1> splat (i1 true), i32 %evl)
+  %remaining.trip.count = sub nuw i64 %trip.count, %evl2
+  %next.ind = add i64 %scalar.ind, %evl2
+  %m = icmp eq i64 %remaining.trip.count, 0
+  br i1 %m, label %for.cond.cleanup, label %vector.body
+
+for.cond.cleanup:                                 ; preds = %vector.body
+  ret i32 %red
+}
+
+define i32 @vp_reduce_smin(ptr %a) {
+; CHECK-LABEL: vp_reduce_smin:
+; CHECK:       # %bb.0: # %entry
+; CHECK-NEXT:    li a1, 0
+; CHECK-NEXT:    lui a2, 524288
+; CHECK-NEXT:    addi a2, a2, -1
+; CHECK-NEXT:    vsetvli a3, zero, e32, m1, ta, ma
+; CHECK-NEXT:    vmv.s.x v8, a2
+; CHECK-NEXT:    li a2, 1024
+; CHECK-NEXT:  .LBB6_1: # %vector.body
+; CHECK-NEXT:    # =>This Inner Loop Header: Depth=1
+; CHECK-NEXT:    vsetvli a3, a2, e32, m2, ta, ma
+; CHECK-NEXT:    slli a4, a1, 2
+; CHECK-NEXT:    add a4, a0, a4
+; CHECK-NEXT:    vle32.v v10, (a4)
+; CHECK-NEXT:    sub a2, a2, a3
+; CHECK-NEXT:    vredmin.vs v8, v10, v8
+; CHECK-NEXT:    add a1, a1, a3
+; CHECK-NEXT:    bnez a2, .LBB6_1
+; CHECK-NEXT:  # %bb.2: # %for.cond.cleanup
+; CHECK-NEXT:    vmv.x.s a0, v8
+; CHECK-NEXT:    ret
+entry:
+  br label %vector.body
+
+vector.body:                                      ; preds = %vector.body, %entry
+  %trip.count = phi i64 [ 1024, %entry ], [ %remaining.trip.count, %vector.body ]
+  %scalar.ind = phi i64 [ 0, %entry ], [ %next.ind, %vector.body ]
+  %red.phi = phi i32 [ 2147483647, %entry ], [ %red, %vector.body ]
+  %evl = tail call i32 @llvm.experimental.get.vector.length.i64(i64 %trip.count, i32 4, i1 true)
+  %evl2 = zext i32 %evl to i64
+  %arrayidx6 = getelementptr inbounds i32, ptr %a, i64 %scalar.ind
+  %wide.load = tail call <vscale x 4 x i32> @llvm.vp.load.nxv4i32.p0(ptr %arrayidx6, <vscale x 4 x i1> splat (i1 true), i32 %evl)
+  %red = tail call i32 @llvm.vp.reduce.smin.nxv4i32(i32 %red.phi, <vscale x 4 x i32> %wide.load, <vscale x 4 x i1> splat (i1 true), i32 %evl)
+  %remaining.trip.count = sub nuw i64 %trip.count, %evl2
+  %next.ind = add i64 %scalar.ind, %evl2
+  %m = icmp eq i64 %remaining.trip.count, 0
+  br i1 %m, label %for.cond.cleanup, label %vector.body
+
+for.cond.cleanup:                                 ; preds = %vector.body
+  ret i32 %red
+}
+
+define i32 @vp_reduce_umax(ptr %a) {
+; CHECK-LABEL: vp_reduce_umax:
+; CHECK:       # %bb.0: # %entry
+; CHECK-NEXT:    li a1, 0
+; CHECK-NEXT:    vsetvli a2, zero, e32, m1, ta, ma
+; CHECK-NEXT:    vmv.s.x v8, zero
+; CHECK-NEXT:    li a2, 1024
+; CHECK-NEXT:  .LBB7_1: # %vector.body
+; CHECK-NEXT:    # =>This Inner Loop Header: Depth=1
+; CHECK-NEXT:    vsetvli a3, a2, e32, m2, ta, ma
+; CHECK-NEXT:    slli a4, a1, 2
+; CHECK-NEXT:    add a4, a0, a4
+; CHECK-NEXT:    vle32.v v10, (a4)
+; CHECK-NEXT:    sub a2, a2, a3
+; CHECK-NEXT:    vredmaxu.vs v8, v10, v8
+; CHECK-NEXT:    add a1, a1, a3
+; CHECK-NEXT:    bnez a2, .LBB7_1
+; CHECK-NEXT:  # %bb.2: # %for.cond.cleanup
+; CHECK-NEXT:    vmv.x.s a0, v8
+; CHECK-NEXT:    ret
+entry:
+  br label %vector.body
+
+vector.body:                                      ; preds = %vector.body, %entry
+  %trip.count = phi i64 [ 1024, %entry ], [ %remaining.trip.count, %vector.body ]
+  %scalar.ind = phi i64 [ 0, %entry ], [ %next.ind, %vector.body ]
+  %red.phi = phi i32 [ 0, %entry ], [ %red, %vector.body ]
+  %evl = tail call i32 @llvm.experimental.get.vector.length.i64(i64 %trip.count, i32 4, i1 true)
+  %evl2 = zext i32 %evl to i64
+  %arrayidx6 = getelementptr inbounds i32, ptr %a, i64 %scalar.ind
+  %wide.load = tail call <vscale x 4 x i32> @llvm.vp.load.nxv4i32.p0(ptr %arrayidx6, <vscale x 4 x i1> splat (i1 true), i32 %evl)
+  %red = tail call i32 @llvm.vp.reduce.umax.nxv4i32(i32 %red.phi, <vscale x 4 x i32> %wide.load, <vscale x 4 x i1> splat (i1 true), i32 %evl)
+  %remaining.trip.count = sub nuw i64 %trip.count, %evl2
+  %next.ind = add i64 %scalar.ind, %evl2
+  %m = icmp eq i64 %remaining.trip.count, 0
+  br i1 %m, label %for.cond.cleanup, label %vector.body
+
+for.cond.cleanup:                                 ; preds = %vector.body
+  ret i32 %red
+}
+
+define i32 @vp_reduce_umin(ptr %a) {
+; CHECK-LABEL: vp_reduce_umin:
+; CHECK:       # %bb.0: # %entry
+; CHECK-NEXT:    li a1, 0
+; CHECK-NEXT:    lui a2, 524288
+; CHECK-NEXT:    vsetvli a3, zero, e32, m1, ta, ma
+; CHECK-NEXT:    vmv.s.x v8, a2
+; CHECK-NEXT:    li a2, 1024
+; CHECK-NEXT:  .LBB8_1: # %vector.body
+; CHECK-NEXT:    # =>This Inner Loop Header: Depth=1
+; CHECK-NEXT:    vsetvli a3, a2, e32, m2, ta, ma
+; CHECK-NEXT:    slli a4, a1, 2
+; CHECK-NEXT:    add a4, a0, a4
+; CHECK-NEXT:    vle32.v v10, (a4)
+; CHECK-NEXT:    sub a2, a2, a3
+; CHECK-NEXT:    vredminu.vs v8, v10, v8
+; CHECK-NEXT:    add a1, a1, a3
+; CHECK-NEXT:    bnez a2, .LBB8_1
+; CHECK-NEXT:  # %bb.2: # %for.cond.cleanup
+; CHECK-NEXT:    vmv.x.s a0, v8
+; CHECK-NEXT:    ret
+entry:
+  br label %vector.body
+
+vector.body:                                      ; preds = %vector.body, %entry
+  %trip.count = phi i64 [ 1024, %entry ], [ %remaining.trip.count, %vector.body ]
+  %scalar.ind = phi i64 [ 0, %entry ], [ %next.ind, %vector.body ]
+  %red.phi = phi i32 [ -2147483648, %entry ], [ %red, %vector.body ]
+  %evl = tail call i32 @llvm.experimental.get.vector.length.i64(i64 %trip.count, i32 4, i1 true)
+  %evl2 = zext i32 %evl to i64
+  %arrayidx6 = getelementptr inbounds i32, ptr %a, i64 %scalar.ind
+  %wide.load = tail call <vscale x 4 x i32> @llvm.vp.load.nxv4i32.p0(ptr %arrayidx6, <vscale x 4 x i1> splat (i1 true), i32 %evl)
+  %red = tail call i32 @llvm.vp.reduce.umin.nxv4i32(i32 %red.phi, <vscale x 4 x i32> %wide.load, <vscale x 4 x i1> splat (i1 true), i32 %evl)
+  %remaining.trip.count = sub nuw i64 %trip.count, %evl2
+  %next.ind = add i64 %scalar.ind, %evl2
+  %m = icmp eq i64 %remaining.trip.count, 0
+  br i1 %m, label %for.cond.cleanup, label %vector.body
+
+for.cond.cleanup:                                 ; preds = %vector.body
+  ret i32 %red
+}
+
+define float @vp_reduce_fadd(ptr %a) {
+; CHECK-LABEL: vp_reduce_fadd:
+; CHECK:       # %bb.0: # %entry
+; CHECK-NEXT:    li a1, 0
+; CHECK-NEXT:    vsetvli a2, zero, e32, m1, ta, ma
+; CHECK-NEXT:    vmv.s.x v8, zero
+; CHECK-NEXT:    li a2, 1024
+; CHECK-NEXT:  .LBB9_1: # %vector.body
+; CHECK-NEXT:    # =>This Inner Loop Header: Depth=1
+; CHECK-NEXT:    vsetvli a3, a2, e32, m2, ta, ma
+; CHECK-NEXT:    slli a4, a1, 2
+; CHECK-NEXT:    add a4, a0, a4
+; CHECK-NEXT:    vle32.v v10, (a4)
+; CHECK-NEXT:    sub a2, a2, a3
+; CHECK-NEXT:    vfredosum.vs v8, v10, v8
+; CHECK-NEXT:    add a1, a1, a3
+; CHECK-NEXT:    bnez a2, .LBB9_1
+; CHECK-NEXT:  # %bb.2: # %for.cond.cleanup
+; CHECK-NEXT:    vfmv.f.s fa0, v8
+; CHECK-NEXT:    ret
+entry:
+  br label %vector.body
+
+vector.body:                                      ; preds = %vector.body, %entry
+  %trip.count = phi i64 [ 1024, %entry ], [ %remaining.trip.count, %vector.body ]
+  %scalar.ind = phi i64 [ 0, %entry ], [ %next.ind, %vector.body ]
+  %red.phi = phi float [ 0.000000e+00, %entry ], [ %red, %vector.body ]
+  %evl = tail call i32 @llvm.experimental.get.vector.length.i64(i64 %trip.count, i32 4, i1 true)
+  %evl2 = zext i32 %evl to i64
+  %arrayidx6 = getelementptr inbounds float, ptr %a, i64 %scalar.ind
+  %wide.load = tail call <vscale x 4 x float> @llvm.vp.load.nxv4f32.p0(ptr %arrayidx6, <vscale x 4 x i1> splat (i1 true), i32 %evl)
+  %red = tail call float @llvm.vp.reduce.fadd.nxv4f32(float %red.phi, <vscale x 4 x float> %wide.load, <vscale x 4 x i1> splat (i1 true), i32 %evl)
+  %remaining.trip.count = sub nuw i64 %trip.count, %evl2
+  %next.ind = add i64 %scalar.ind, %evl2
+  %m = icmp eq i64 %remaining.trip.count, 0
+  br i1 %m, label %for.cond.cleanup, label %vector.body
+
+for.cond.cleanup:                                 ; preds = %vector.body
+  ret float %red
+}
+
+define float @vp_reduce_fmax(ptr %a) {
+; CHECK-LABEL: vp_reduce_fmax:
+; CHECK:       # %bb.0: # %entry
+; CHECK-NEXT:    li a1, 0
+; CHECK-NEXT:    vsetvli a2, zero, e32, m1, ta, ma
+; CHECK-NEXT:    vmv.s.x v8, zero
+; CHECK-NEXT:    li a2, 1024
+; CHECK-NEXT:  .LBB10_1: # %vector.body
+; CHECK-NEXT:    # =>This Inner Loop Header: Depth=1
+; CHECK-NEXT:    vsetvli a3, a2, e32, m2, ta, ma
+; CHECK-NEXT:    slli a4, a1, 2
+; CHECK-NEXT:    add a4, a0, a4
+; CHECK-NEXT:    vle32.v v10, (a4)
+; CHECK-NEXT:    sub a2, a2, a3
+; CHECK-NEXT:    vfredmax.vs v8, v10, v8
+; CHECK-NEXT:    add a1, a1, a3
+; CHECK-NEXT:    bnez a2, .LBB10_1
+; CHECK-NEXT:  # %bb.2: # %for.cond.cleanup
+; CHECK-NEXT:    vfmv.f.s fa0, v8
+; CHECK-NEXT:    ret
+entry:
+  br label %vector.body
+
+vector.body:                                      ; preds = %vector.body, %entry
+  %trip.count = phi i64 [ 1024, %entry ], [ %remaining.trip.count, %vector.body ]
+  %scalar.ind = phi i64 [ 0, %entry ], [ %next.ind, %vector.body ]
+  %red.phi = phi float [ 0.000000e+00, %entry ], [ %red, %vector.body ]
+  %evl = tail call i32 @llvm.experimental.get.vector.length.i64(i64 %trip.count, i32 4, i1 true)
+  %evl2 = zext i32 %evl to i64
+  %arrayidx6 = getelementptr inbounds float, ptr %a, i64 %scalar.ind
+  %wide.load = tail call <vscale x 4 x float> @llvm.vp.load.nxv4f32.p0(ptr %arrayidx6, <vscale x 4 x i1> splat (i1 true), i32 %evl)
+  %red = tail call float @llvm.vp.reduce.fmax.nxv4f32(float %red.phi, <vscale x 4 x float> %wide.load, <vscale x 4 x i1> splat (i1 true), i32 %evl)
+  %remaining.trip.count = sub nuw i64 %trip.count, %evl2
+  %next.ind = add i64 %scalar.ind, %evl2
+  %m = icmp eq i64 %remaining.trip.count, 0
+  br i1 %m, label %for.cond.cleanup, label %vector.body
+
+for.cond.cleanup:                                 ; preds = %vector.body
+  ret float %red
+}
+
+define float @vp_reduce_fmin(ptr %a) {
+; CHECK-LABEL: vp_reduce_fmin:
+; CHECK:       # %bb.0: # %entry
+; CHECK-NEXT:    li a1, 0
+; CHECK-NEXT:    vsetvli a2, zero, e32, m1, ta, ma
+; CHECK-NEXT:    vmv.s.x v8, zero
+; CHECK-NEXT:    li a2, 1024
+; CHECK-NEXT:  .LBB11_1: # %vector.body
+; CHECK-NEXT:    # =>This Inner Loop Header: Depth=1
+; CHECK-NEXT:    vsetvli a3, a2, e32, m2, ta, ma
+; CHECK-NEXT:    slli a4, a1, 2
+; CHECK-NEXT:    add a4, a0, a4
+; CHECK-NEXT:    vle32.v v10, (a4)
+; CHECK-NEXT:    sub a2, a2, a3
+; CHECK-NEXT:    vfredmin.vs v8, v10, v8
+; CHECK-NEXT:    add a1, a1, a3
+; CHECK-NEXT:    bnez a2, .LBB11_1
+; CHECK-NEXT:  # %bb.2: # %for.cond.cleanup
+; CHECK-NEXT:    vfmv.f.s fa0, v8
+; CHECK-NEXT:    ret
+entry:
+  br label %vector.body
+
+vector.body:                                      ; preds = %vector.body, %entry
+  %trip.count = phi i64 [ 1024, %entry ], [ %r...
[truncated]

@topperc
Copy link
Collaborator

topperc commented Mar 19, 2025

The original patch only supported ordered fadd. Why does this to support more than just vp.reduce.fadd?

@topperc
Copy link
Collaborator

topperc commented Mar 19, 2025

The original patch only supported ordered fadd. Why does this to support more than just vp.reduce.fadd?

Oh I guess its because of the scalar input that the regular reduce intrinsics don't have?

@NexMing
Copy link
Contributor Author

NexMing commented Mar 19, 2025

The original patch only supported ordered fadd. Why does this to support more than just vp.reduce.fadd?

The original patch only supported ordered fadd. Why does this to support more than just vp.reduce.fadd?

Oh I guess its because of the scalar input that the regular reduce intrinsics don't have?

That's right. I think this optimization will not change the order of operations.

yanming added 2 commits March 21, 2025 15:56
        This patch is vector predication version of commit 15b0fab
Copy link
Contributor

@lukel97 lukel97 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM, thanks

Can you update the comment above visitIntrinsicInst to say something like

// This affects ordered fadd reductions and VP reductions that have a scalar start value. // This tries to vectorize any scalar phis that feed into these reductions:

@NexMing NexMing requested a review from lukel97 March 26, 2025 05:22
@NexMing
Copy link
Contributor Author

NexMing commented Mar 31, 2025

Could someone help merge this PR?

@lukel97 lukel97 merged commit 5c65a32 into llvm:main Mar 31, 2025
11 checks passed
@lukel97
Copy link
Contributor

lukel97 commented Mar 31, 2025

Could someone help merge this PR?

I'm just noticing it looks like you're a member of the LLVM organization, does that mean you already have commit access?

@NexMing
Copy link
Contributor Author

NexMing commented Mar 31, 2025

Could someone help merge this PR?

I'm just noticing it looks like you're a member of the LLVM organization, does that mean you already have commit access?

I applied for permissions back when LLVM was using Phabricator, but they may have been revoked due to long-term inactivity. Do I need to reapply for the permissions now?

@NexMing NexMing deleted the vp-reduce-phi branch March 31, 2025 08:48
@llvm-ci
Copy link
Collaborator

llvm-ci commented Mar 31, 2025

LLVM Buildbot has detected a new failure on builder clang-aarch64-sve-vla running on linaro-g3-04 while building llvm at step 7 "ninja check 1".

Full details are available at: https://lab.llvm.org/buildbot/#/builders/17/builds/6891

Here is the relevant piece of the build log for the reference
Step 7 (ninja check 1) failure: stage 1 checked (failure)
...
llvm-lit: /home/tcwg-buildbot/worker/clang-aarch64-sve-vla/llvm/llvm/utils/lit/lit/llvm/config.py:520: note: using ld64.lld: /home/tcwg-buildbot/worker/clang-aarch64-sve-vla/stage1/bin/ld64.lld
llvm-lit: /home/tcwg-buildbot/worker/clang-aarch64-sve-vla/llvm/llvm/utils/lit/lit/llvm/config.py:520: note: using wasm-ld: /home/tcwg-buildbot/worker/clang-aarch64-sve-vla/stage1/bin/wasm-ld
llvm-lit: /home/tcwg-buildbot/worker/clang-aarch64-sve-vla/llvm/llvm/utils/lit/lit/discovery.py:276: warning: input '/home/tcwg-buildbot/worker/clang-aarch64-sve-vla/stage1/runtimes/runtimes-bins/compiler-rt/test/interception/Unit' contained no tests
llvm-lit: /home/tcwg-buildbot/worker/clang-aarch64-sve-vla/llvm/llvm/utils/lit/lit/discovery.py:276: warning: input '/home/tcwg-buildbot/worker/clang-aarch64-sve-vla/stage1/runtimes/runtimes-bins/compiler-rt/test/sanitizer_common/Unit' contained no tests
llvm-lit: /home/tcwg-buildbot/worker/clang-aarch64-sve-vla/llvm/llvm/utils/lit/lit/llvm/config.py:520: note: using ld.lld: /home/tcwg-buildbot/worker/clang-aarch64-sve-vla/stage1/bin/ld.lld
llvm-lit: /home/tcwg-buildbot/worker/clang-aarch64-sve-vla/llvm/llvm/utils/lit/lit/llvm/config.py:520: note: using lld-link: /home/tcwg-buildbot/worker/clang-aarch64-sve-vla/stage1/bin/lld-link
llvm-lit: /home/tcwg-buildbot/worker/clang-aarch64-sve-vla/llvm/llvm/utils/lit/lit/llvm/config.py:520: note: using ld64.lld: /home/tcwg-buildbot/worker/clang-aarch64-sve-vla/stage1/bin/ld64.lld
llvm-lit: /home/tcwg-buildbot/worker/clang-aarch64-sve-vla/llvm/llvm/utils/lit/lit/llvm/config.py:520: note: using wasm-ld: /home/tcwg-buildbot/worker/clang-aarch64-sve-vla/stage1/bin/wasm-ld
-- Testing: 97613 tests, 64 workers --
UNRESOLVED: Flang :: Driver/slp-vectorize.ll (1 of 97613)
******************** TEST 'Flang :: Driver/slp-vectorize.ll' FAILED ********************
Test has no 'RUN:' line
********************
PASS: LLVM :: CodeGen/RISCV/rvv/expandload.ll (2 of 97613)
PASS: libFuzzer-aarch64-default-Linux :: swap-cmp.test (3 of 97613)
PASS: UBSan-AddressSanitizer-aarch64 :: TestCases/ImplicitConversion/signed-integer-truncation-ignorelist.c (4 of 97613)
PASS: Clang :: CodeGen/X86/avx2-builtins.c (5 of 97613)
PASS: LLVM :: CodeGen/AMDGPU/llvm.amdgcn.permlane.ll (6 of 97613)
PASS: UBSan-AddressSanitizer-aarch64 :: TestCases/ImplicitConversion/unsigned-integer-truncation-ignorelist.c (7 of 97613)
PASS: Clang :: OpenMP/target_simd_codegen_registration.cpp (8 of 97613)
PASS: Clang :: OpenMP/target_teams_distribute_codegen_registration.cpp (9 of 97613)
PASS: Clang :: Headers/arm-neon-header.c (10 of 97613)
PASS: Clang :: OpenMP/target_parallel_for_codegen_registration.cpp (11 of 97613)
PASS: UBSan-ThreadSanitizer-aarch64 :: TestCases/ImplicitConversion/signed-integer-truncation-ignorelist.c (12 of 97613)
PASS: Clang :: Driver/x86-target-features.c (13 of 97613)
PASS: Clang :: CodeGen/X86/rot-intrinsics.c (14 of 97613)
PASS: Clang :: OpenMP/target_teams_distribute_simd_codegen_registration.cpp (15 of 97613)
PASS: libFuzzer-aarch64-default-Linux :: value-profile-cmp.test (16 of 97613)
PASS: Clang :: OpenMP/target_parallel_for_simd_codegen_registration.cpp (17 of 97613)
PASS: Clang :: Driver/clang_f_opts.c (18 of 97613)
PASS: UBSan-ThreadSanitizer-aarch64 :: TestCases/ImplicitConversion/unsigned-integer-truncation-ignorelist.c (19 of 97613)
PASS: Clang :: CodeGen/X86/sse2-builtins.c (20 of 97613)
PASS: libFuzzer-aarch64-default-Linux :: large.test (21 of 97613)
PASS: Clang :: Analysis/runtime-regression.c (22 of 97613)
PASS: libFuzzer-aarch64-default-Linux :: msan.test (23 of 97613)
PASS: Clang :: CodeGen/X86/avx-builtins.c (24 of 97613)
PASS: Clang :: Driver/linux-ld.c (25 of 97613)
PASS: ThreadSanitizer-aarch64 :: signal_thread.cpp (26 of 97613)
PASS: SanitizerCommon-lsan-aarch64-Linux :: Linux/signal_segv_handler.cpp (27 of 97613)
PASS: LLVM :: CodeGen/ARM/build-attributes.ll (28 of 97613)
PASS: HWAddressSanitizer-aarch64 :: TestCases/Linux/create-thread-stress.cpp (29 of 97613)
PASS: SanitizerCommon-hwasan-aarch64-Linux :: Linux/signal_segv_handler.cpp (30 of 97613)
PASS: ThreadSanitizer-aarch64 :: deadlock_detector_stress_test.cpp (31 of 97613)
PASS: SanitizerCommon-tsan-aarch64-Linux :: Linux/signal_segv_handler.cpp (32 of 97613)
PASS: LLVM :: CodeGen/AMDGPU/sched-group-barrier-pipeline-solver.mir (33 of 97613)
PASS: Clang :: OpenMP/target_teams_distribute_parallel_for_simd_codegen_registration.cpp (34 of 97613)
PASS: SanitizerCommon-msan-aarch64-Linux :: Linux/signal_segv_handler.cpp (35 of 97613)
PASS: SanitizerCommon-asan-aarch64-Linux :: Linux/signal_segv_handler.cpp (36 of 97613)
PASS: Clang :: Preprocessor/predefined-arch-macros.c (37 of 97613)

@lukel97
Copy link
Contributor

lukel97 commented Mar 31, 2025

Could someone help merge this PR?

I'm just noticing it looks like you're a member of the LLVM organization, does that mean you already have commit access?

I applied for permissions back when LLVM was using Phabricator, but they may have been revoked due to long-term inactivity. Do I need to reapply for the permissions now?

Oh yes I see you've been moved to the triagers team. I think according to the RFC you can create a new GitHub issue to restore commit access?

@tstellar I couldn't find anything in https://llvm.org/docs/DeveloperPolicy.html#obtaining-commit-access about regaining commit access as described in the RFC, should that be documented somewhere?

SchrodingerZhu pushed a commit to SchrodingerZhu/llvm-project that referenced this pull request Mar 31, 2025
LLVM vector predication reduction intrinsics return a scalar result, but
on RISC-V vector reduction instructions write the result in the first
element of a vector register. So when a reduction in a loop uses a
scalar phi, we end up with unnecessary scalar moves:
```asm
loop:
    vmv.s.x v8, zero
    vredsum.vs v8, v10, v8
    vmv.x.s a0, v8
````
This mainly affects vector predication reduction. This tries to
vectorize any scalar phis that feed into a vector predication reduction
in RISCVCodeGenPrepare, converting:
```llvm
vector.body:
%red.phi = phi i32 [ ..., %entry ], [ %red, %vector.body ]
%red = tail call i32 @llvm.vp.reduce.add.nxv4i32(i32 %red.phi, <vscale x 4 x i32> %wide.load, <vscale x 4 x i1> splat (i1 true), i32 %evl)
```
to
```llvm
vector.body:
%red.phi = phi <vscale x 2 x i32> [ ..., %entry ], [ %acc.vec, %vector.body]
%phi.scalar = extractelement <vscale x 2 x i32> %red.phi, i64 0
%acc = tail call i32 @llvm.vp.reduce.add.nxv4i32(i32 %phi.scalar, <vscale x 4 x i32> %wide.load, <vscale x 4 x i1> splat (i1 true), i32 %evl)
%acc.vec = insertelement <vscale x 2 x i32> poison, float %acc, i64 0
```
Which eliminates the scalar -> vector -> scalar crossing during
instruction selection.

---------

Co-authored-by: yanming <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants