Skip to content

[AMDGPU] Use wider loop lowering type for LowerMemIntrinsics #112332

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 3 commits into from
Oct 28, 2024
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
38 changes: 29 additions & 9 deletions llvm/lib/Target/AMDGPU/AMDGPUTargetTransformInfo.cpp
Original file line number Diff line number Diff line change
Expand Up @@ -75,6 +75,13 @@ static cl::opt<size_t> InlineMaxBB(
cl::desc("Maximum number of BBs allowed in a function after inlining"
" (compile time constraint)"));

// This default unroll factor is based on microbenchmarks on gfx1030.
static cl::opt<unsigned> MemcpyLoopUnroll(
"amdgpu-memcpy-loop-unroll",
cl::desc("Unroll factor (affecting 4x32-bit operations) to use for memory "
"operations when lowering memcpy as a loop"),
cl::init(16), cl::Hidden);

static bool dependsOnLocalPhi(const Loop *L, const Value *Cond,
unsigned Depth = 0) {
const Instruction *I = dyn_cast<Instruction>(Cond);
Expand Down Expand Up @@ -409,13 +416,8 @@ int64_t GCNTTIImpl::getMaxMemIntrinsicInlineSizeThreshold() const {
return 1024;
}

// FIXME: Really we would like to issue multiple 128-bit loads and stores per
// iteration. Should we report a larger size and let it legalize?
//
// FIXME: Should we use narrower types for local/region, or account for when
// unaligned access is legal?
//
// FIXME: This could use fine tuning and microbenchmarks.
Type *GCNTTIImpl::getMemcpyLoopLoweringType(
LLVMContext &Context, Value *Length, unsigned SrcAddrSpace,
unsigned DestAddrSpace, Align SrcAlign, Align DestAlign,
Expand All @@ -442,17 +444,29 @@ Type *GCNTTIImpl::getMemcpyLoopLoweringType(
return FixedVectorType::get(Type::getInt32Ty(Context), 2);
}

// Global memory works best with 16-byte accesses. Private memory will also
// hit this, although they'll be decomposed.
return FixedVectorType::get(Type::getInt32Ty(Context), 4);
// Global memory works best with 16-byte accesses.
// If the operation has a fixed known length that is large enough, it is
// worthwhile to return an even wider type and let legalization lower it into
// multiple accesses, effectively unrolling the memcpy loop. Private memory
// also hits this, although accesses may be decomposed.
//
// Don't unroll if Length is not a constant, since unrolling leads to worse
// performance for length values that are smaller or slightly larger than the
// total size of the type returned here. Mitigating that would require a more
// complex lowering for variable-length memcpy and memmove.
unsigned I32EltsInVector = 4;
if (MemcpyLoopUnroll > 0 && isa<ConstantInt>(Length))
return FixedVectorType::get(Type::getInt32Ty(Context),
MemcpyLoopUnroll * I32EltsInVector);

return FixedVectorType::get(Type::getInt32Ty(Context), I32EltsInVector);
}

void GCNTTIImpl::getMemcpyLoopResidualLoweringType(
SmallVectorImpl<Type *> &OpsOut, LLVMContext &Context,
unsigned RemainingBytes, unsigned SrcAddrSpace, unsigned DestAddrSpace,
Align SrcAlign, Align DestAlign,
std::optional<uint32_t> AtomicCpySize) const {
assert(RemainingBytes < 16);

if (AtomicCpySize)
BaseT::getMemcpyLoopResidualLoweringType(
Expand All @@ -462,6 +476,12 @@ void GCNTTIImpl::getMemcpyLoopResidualLoweringType(
Align MinAlign = std::min(SrcAlign, DestAlign);

if (MinAlign != Align(2)) {
Type *I32x4Ty = FixedVectorType::get(Type::getInt32Ty(Context), 4);
while (RemainingBytes >= 16) {
OpsOut.push_back(I32x4Ty);
RemainingBytes -= 16;
}

Type *I64Ty = Type::getInt64Ty(Context);
while (RemainingBytes >= 8) {
OpsOut.push_back(I64Ty);
Expand Down
Loading
Loading