[MLIR][XeGPU] Allow some nd ops to have argument shapes mismatch for … #120566

kurapov-peter · 2024-12-19T12:18:48Z

…the distributed IR case.

This patch allows nd_load and nd_store to preserve the tensor descriptor shape during distribution to SIMT. The validation now expects the distributed instruction to retain the sg_map attribute and uses it to verify the consistency.

Some background:
We've been discussing the appropriate way of distributing xegpu operations and there are multiple approaches possible. I'm listing the three main ones commenting on their properties.

Distribute both the tensor descriptor type and the vector argument during the distribution transformation. This seems like a natural way of applying the transformation. The resulting IR describes what part of the computation and memory a lane "owns", and plays by the rules of the infrastructure. This is also similar to what happens to non-nd xegpu ops, so we get the benefit of consistent representation. The downside of the approach is that when we lower the resulting operation to an llvm intrinsic, the information about the original non-distributed tensor type must be restored (the instrinsic is cooperative and expects a complete size of a loaded/stored tile). This can technically be done (e.g., using sg_map), but somewhat breaks the promise of XeGPU to represent HW abstractions 1:1 (arguably, imo). Allowing distributed tensor descriptors also means users would be able to create and use them inappropriately, so some UB cases are inevitably introduced.
Distribute only the vector type, and preserve the original tensor descriptor type. This approach addresses the concerns above by adding some flexibility to nd ops validation (this patch). By preserving the descriptor any valid IR can produce some reasonable lowering. This approach comes at the price of a vague IR-to-HW-constructs mapping: now the IR no longer represents something a single logical thread in SIMT owns and acts upon, so the abstraction layering is still broken. This can potentially have some unexpected implications that we don't see today on other transformations and analyses (e.g., nd_load and nd_store are ops that should have memory side effects that are not implemented at the moment. It is unclear to me what implications this violation of "ownership" may have.).
Perform the distribution during lowering to a lower-level dialect. Instead of choosing between the two unappealing options we may resolve the layering properly by converting the tensor descriptor from XeGPU dialect to an appropriate lower-level construct that deals with scalars instead of vectors. This way we can keep all the promises of XeGPU intact and avoid the somewhat dubious type issues. The downside of course is that we are a putting what essentially is a transformation into a conversion which is also not ideal.

My personal opinion on this is that we should take the 1 approach and modify it in the way so that the creation of a tensor descriptor survives the distribution but instructions are using views into the descriptor so the IR describes "what a logical thread does" and there are no type mismatches (introduce some xegpu view into tensor descriptor OP, so that it would work for both ND and scattered case in the distribution logic). That said, I'm OK to put this patch in for experiments with the second approach as it doesn't break anything.

@Jianhui-Li, @charithaintc, @chencha3, @adam-smnk, @rengolin

llvmbot · 2024-12-19T12:19:25Z

@llvm/pr-subscribers-mlir

@llvm/pr-subscribers-mlir-gpu

Author: Petr Kurapov (kurapov-peter)

Changes

…the distributed IR case.

This patch allows nd_load and nd_store to preserve the tensor descriptor shape during distribution to SIMT. The validation now expects the distributed instruction to retain the sg_map attribute and uses it to verify the consistency.

Some background:
We've been discussing the appropriate way of distributing xegpu operations and there are multiple approaches possible. I'm listing the three main ones commenting on their properties.

Distribute both the tensor descriptor type and the vector argument during the distribution transformation. This seems like a natural way of applying the transformation. The resulting IR describes what part of the computation and memory a lane "owns", and plays by the rules of the infrastructure. This is also similar to what happens to non-nd xegpu ops, so we get the benefit of consistent representation. The downside of the approach is that when we lower the resulting operation to an llvm intrinsic, the information about the original non-distributed tensor type must be restored (the instrinsic is cooperative and expects a complete size of a loaded/stored tile). This can technically be done (e.g., using sg_map), but somewhat breaks the promise of XeGPU to represent HW abstractions 1:1 (arguably, imo). Allowing distributed tensor descriptors also means users would be able to create and use them inappropriately, so some UB cases are inevitably introduced.
Distribute only the vector type, and preserve the original tensor descriptor type. This approach addresses the concerns above by adding some flexibility to nd ops validation (this patch). By preserving the descriptor any valid IR can produce some reasonable lowering. This approach comes at the price of a vague IR-to-HW-constructs mapping: now the IR no longer represents something a single logical thread in SIMT owns and acts upon, so the abstraction layering is still broken. This can potentially have some unexpected implications that we don't see today on other transformations and analyses (e.g., nd_load and nd_store are ops that should have memory side effects that are not implemented at the moment. It is unclear to me what implications this violation of "ownership" may have.).
Perform the distribution during lowering to a lower-level dialect. Instead of choosing between the two unappealing options we may resolve the layering properly by converting the tensor descriptor from XeGPU dialect to an appropriate lower-level construct that deals with scalars instead of vectors. This way we can keep all the promises of XeGPU intact and avoid the somewhat dubious type issues. The downside of course is that we are a putting what essentially is a transformation into a conversion which is also not ideal.

My personal opinion on this is that we should take the 1 approach and modify it in the way so that the creation of a tensor descriptor survives the distribution but instructions are using views into the descriptor so the IR describes "what a logical thread does" and there are no type mismatches (introduce some xegpu view into tensor descriptor OP, so that it would work for both ND and scattered case in the distribution logic). That said, I'm OK to put this patch in for experiments with the second approach as it doesn't break anything.

@Jianhui-Li, @charithaintc, @chencha3, @adam-smnk, @rengolin

Full diff: https://github.com/llvm/llvm-project/pull/120566.diff

4 Files Affected:

(modified) mlir/include/mlir/Dialect/XeGPU/IR/XeGPUOps.td (+1-2)
(modified) mlir/lib/Dialect/XeGPU/IR/XeGPUOps.cpp (+53-20)
(modified) mlir/test/Dialect/XeGPU/XeGPUOps.mlir (+24)
(modified) mlir/test/Dialect/XeGPU/invalid.mlir (+6-6)

diff --git a/mlir/include/mlir/Dialect/XeGPU/IR/XeGPUOps.td b/mlir/include/mlir/Dialect/XeGPU/IR/XeGPUOps.td
index 5910aa3f7f2dae..f3ffbd0f5a027d 100644
--- a/mlir/include/mlir/Dialect/XeGPU/IR/XeGPUOps.td
+++ b/mlir/include/mlir/Dialect/XeGPU/IR/XeGPUOps.td
@@ -327,8 +327,7 @@ def XeGPU_LoadNdOp : XeGPU_Op<"load_nd", [AllElementTypesMatch<["value", "Tensor
   let hasVerifier = 1;
 }
 
-def XeGPU_StoreNdOp : XeGPU_Op<"store_nd", [AllShapesMatch<["value", "TensorDesc"]>,
-                                       AllElementTypesMatch<["value", "TensorDesc"]>]> {
+def XeGPU_StoreNdOp : XeGPU_Op<"store_nd", [AllElementTypesMatch<["value", "TensorDesc"]>]> {
   let summary = "stores a n-D block register region back to memory, currently only supports 2D";
 
   let description = [{
diff --git a/mlir/lib/Dialect/XeGPU/IR/XeGPUOps.cpp b/mlir/lib/Dialect/XeGPU/IR/XeGPUOps.cpp
index 9d3c4366a7bd50..721cba70520758 100644
--- a/mlir/lib/Dialect/XeGPU/IR/XeGPUOps.cpp
+++ b/mlir/lib/Dialect/XeGPU/IR/XeGPUOps.cpp
@@ -73,6 +73,29 @@ static bool isWriteHintOrNone(const CachePolicyAttr &attr) {
          kind == CachePolicy::WRITE_BACK || kind == CachePolicy::WRITE_THROUGH;
 }
 
+// Validations for nd instruction arguments is successful if any of these are
+// true:
+// - tensor descriptor and the output vector shapes exactly match.
+// - tensor descriptor has a sg_map attribute and the distributed vector shape
+//   matches the tensor descriptor shape when scaled using sg_map factors on
+//   each dimension.
+static bool isArgShapesValid(ArrayRef<int64_t> descShape,
+                             ArrayRef<int64_t> valShape, SGMapAttr sgMap) {
+  if (descShape == valShape)
+    return true;
+
+  if (!sgMap)
+    return false;
+
+  for (const auto &[factor, dim, expected] :
+       llvm::zip_equal(sgMap.getWiLayout(), valShape, descShape)) {
+    if (factor * dim != expected)
+      return false;
+  }
+
+  return true;
+}
+
 //===----------------------------------------------------------------------===//
 // XeGPU_CreateNdDescOp
 //===----------------------------------------------------------------------===//
@@ -210,13 +233,13 @@ LogicalResult PrefetchNdOp::verify() {
     return emitOpError("Expects a non-scattered TensorDesc.\n");
 
   if (!isReadHintOrNone(getL1HintAttr()))
-    return emitOpError("invlid l1_hint: ") << getL1HintAttr();
+    return emitOpError("invalid l1_hint: ") << getL1HintAttr();
 
   if (!isReadHintOrNone(getL2HintAttr()))
-    return emitOpError("invlid l2_hint: ") << getL2HintAttr();
+    return emitOpError("invalid l2_hint: ") << getL2HintAttr();
 
   if (!isReadHintOrNone(getL3HintAttr()))
-    return emitOpError("invlid l3_hint: ") << getL3HintAttr();
+    return emitOpError("invalid l3_hint: ") << getL3HintAttr();
 
   return success();
 }
@@ -238,13 +261,13 @@ LogicalResult LoadNdOp::verify() {
     return emitOpError("Invalid result, it should be a VectorType.\n");
 
   if (!isReadHintOrNone(getL1HintAttr()))
-    return emitOpError("invlid l1_hint: ") << getL1HintAttr();
+    return emitOpError("invalid l1_hint: ") << getL1HintAttr();
 
   if (!isReadHintOrNone(getL2HintAttr()))
-    return emitOpError("invlid l2_hint: ") << getL2HintAttr();
+    return emitOpError("invalid l2_hint: ") << getL2HintAttr();
 
   if (!isReadHintOrNone(getL3HintAttr()))
-    return emitOpError("invlid l3_hint: ") << getL3HintAttr();
+    return emitOpError("invalid l3_hint: ") << getL3HintAttr();
 
   auto array_len = tdescTy.getArrayLength();
   auto tdescShape = getShapeOf(tdescTy);
@@ -280,8 +303,9 @@ LogicalResult LoadNdOp::verify() {
     auto it = tdescShape.begin();
     tdescShape.insert(it, array_len);
   }
+  auto sgMap = tdescTy.getSGMapAttr();
 
-  if (tdescShape != valueShape)
+  if (!isArgShapesValid(tdescShape, valueShape, sgMap))
     return emitOpError() << "Result shape doesn't match TensorDesc shape."
                          << "The expected shape is " << makeString(tdescShape)
                          << ". But the given shape is "
@@ -303,17 +327,26 @@ LogicalResult StoreNdOp::verify() {
     return emitOpError("Expects a non-scattered TensorDesc.\n");
 
   if (!valTy)
-    return emitOpError("Exepcting a VectorType result.\n");
+    return emitOpError("Expecting a VectorType result.\n");
 
   if (!isWriteHintOrNone(getL1HintAttr()))
-    return emitOpError("invlid l1_hint: ") << getL1HintAttr();
+    return emitOpError("invalid l1_hint: ") << getL1HintAttr();
 
   if (!isWriteHintOrNone(getL2HintAttr()))
-    return emitOpError("invlid l2_hint: ") << getL2HintAttr();
+    return emitOpError("invalid l2_hint: ") << getL2HintAttr();
 
   if (!isWriteHintOrNone(getL3HintAttr()))
-    return emitOpError("invlid l3_hint: ") << getL3HintAttr();
+    return emitOpError("invalid l3_hint: ") << getL3HintAttr();
+
+  auto tdescShape = getShapeOf(dstTy);
+  auto valueShape = getShapeOf(valTy);
+  auto sgMap = dstTy.getSGMapAttr();
 
+  if (!isArgShapesValid(tdescShape, valueShape, sgMap))
+    return emitOpError() << "Result shape doesn't match TensorDesc shape."
+                         << "The expected shape is " << makeString(tdescShape)
+                         << ". But the given shape is "
+                         << makeString(valueShape) << ".\n";
   return success();
 }
 
@@ -423,13 +456,13 @@ LogicalResult PrefetchOp::verify() {
     return emitOpError("Expects a scattered TensorDesc.\n");
 
   if (!isReadHintOrNone(getL1HintAttr()))
-    return emitOpError("invlid l1_hint: ") << getL1HintAttr();
+    return emitOpError("invalid l1_hint: ") << getL1HintAttr();
 
   if (!isReadHintOrNone(getL2HintAttr()))
-    return emitOpError("invlid l2_hint: ") << getL2HintAttr();
+    return emitOpError("invalid l2_hint: ") << getL2HintAttr();
 
   if (!isReadHintOrNone(getL3HintAttr()))
-    return emitOpError("invlid l3_hint: ") << getL3HintAttr();
+    return emitOpError("invalid l3_hint: ") << getL3HintAttr();
 
   return success();
 }
@@ -446,13 +479,13 @@ LogicalResult LoadGatherOp::verify() {
     return emitOpError("Expects a scattered TensorDesc.\n");
 
   if (!isReadHintOrNone(getL1HintAttr()))
-    return emitOpError("invlid l1_hint: ") << getL1HintAttr();
+    return emitOpError("invalid l1_hint: ") << getL1HintAttr();
 
   if (!isReadHintOrNone(getL2HintAttr()))
-    return emitOpError("invlid l2_hint: ") << getL2HintAttr();
+    return emitOpError("invalid l2_hint: ") << getL2HintAttr();
 
   if (!isReadHintOrNone(getL3HintAttr()))
-    return emitOpError("invlid l3_hint: ") << getL3HintAttr();
+    return emitOpError("invalid l3_hint: ") << getL3HintAttr();
 
   auto tdescElemTy = tdescTy.getElementType();
   auto valueElemTy = getElementType();
@@ -490,13 +523,13 @@ LogicalResult StoreScatterOp::verify() {
     return emitOpError("Expects a scattered TensorDesc.\n");
 
   if (!isWriteHintOrNone(getL1HintAttr()))
-    return emitOpError("invlid l1_hint: ") << getL1HintAttr();
+    return emitOpError("invalid l1_hint: ") << getL1HintAttr();
 
   if (!isWriteHintOrNone(getL2HintAttr()))
-    return emitOpError("invlid l2_hint: ") << getL2HintAttr();
+    return emitOpError("invalid l2_hint: ") << getL2HintAttr();
 
   if (!isWriteHintOrNone(getL3HintAttr()))
-    return emitOpError("invlid l3_hint: ") << getL3HintAttr();
+    return emitOpError("invalid l3_hint: ") << getL3HintAttr();
 
   auto maskTy = getMaskType();
   auto valueTy = getValueType();
diff --git a/mlir/test/Dialect/XeGPU/XeGPUOps.mlir b/mlir/test/Dialect/XeGPU/XeGPUOps.mlir
index a4587faa3345cb..d7174a489888a4 100644
--- a/mlir/test/Dialect/XeGPU/XeGPUOps.mlir
+++ b/mlir/test/Dialect/XeGPU/XeGPUOps.mlir
@@ -86,6 +86,17 @@ gpu.func @test_load_nd_vc_2(%src: memref<8x16xf16>) {
   gpu.return
 }
 
+// load_nd args may have different shapes, validated against sg_map
+// CHECK: func @test_load_nd_vc_3(%[[arg0:.*]]: memref<24x32xf32>) {
+gpu.func @test_load_nd_vc_3(%src: memref<24x32xf32>) {
+  // CHECK: %[[R0:.*]] = xegpu.create_nd_tdesc %arg0[0, 0] : memref<24x32xf32> -> !xegpu.tensor_desc<8x16xf32, #xegpu.sg_map<wi_layout = [1, 16], wi_data = [1, 1]>>
+  %1 = xegpu.create_nd_tdesc %src[0, 0] : memref<24x32xf32> ->
+    !xegpu.tensor_desc<8x16xf32, #xegpu.sg_map<wi_layout = [1, 16], wi_data = [1, 1]>>
+  // CHECK: %[[R1:.*]] = xegpu.load_nd %[[R0]] <{l1_hint = #xegpu.cache_hint<cached>, l2_hint = #xegpu.cache_hint<uncached>}> : !xegpu.tensor_desc<8x16xf32, #xegpu.sg_map<wi_layout = [1, 16], wi_data = [1, 1]>> -> vector<8x1xf32>
+  %2 = xegpu.load_nd %1 <{l1_hint = #xegpu.cache_hint<cached>, l2_hint = #xegpu.cache_hint<uncached>}> : !xegpu.tensor_desc<8x16xf32, #xegpu.sg_map<wi_layout = [1, 16], wi_data = [1, 1]>> -> vector<8x1xf32>
+  gpu.return
+}
+
 // CHECK: func @test_store_nd_vc(%[[arg0:.*]]: memref<24x32xf16>) {
 gpu.func @test_store_nd_vc(%dst: memref<24x32xf16>) {
   // CHECK: %[[C:.*]] = arith.constant dense<1.000000e+00> : vector<24x32xf16>
@@ -108,6 +119,19 @@ gpu.func @test_store_nd_vc_2(%dst: memref<24x32xf16>) {
   gpu.return
 }
 
+// store_nd args may have different shapes, validated against sg_map
+// CHECK: func @test_store_nd_vc_3(%[[arg0:.*]]: memref<24x32xf16>) {
+gpu.func @test_store_nd_vc_3(%src: memref<24x32xf16>) {
+   // CHECK: %[[C:.*]] = arith.constant dense<1.000000e+00> : vector<24x2xf16>
+  %1 = arith.constant dense<1.0>: vector<24x2xf16>
+  // CHECK: %[[R0:.*]] = xegpu.create_nd_tdesc %arg0[0, 0] : memref<24x32xf16> -> !xegpu.tensor_desc<24x32xf16, #xegpu.sg_map<wi_layout = [1, 16], wi_data = [1, 1]>>
+  %2 = xegpu.create_nd_tdesc %src[0, 0] : memref<24x32xf16> ->
+    !xegpu.tensor_desc<24x32xf16, #xegpu.sg_map<wi_layout = [1, 16], wi_data = [1, 1]>>
+  // CHECK: xegpu.store_nd %[[C]], %[[R0]] <{l1_hint = #xegpu.cache_hint<write_back>, l2_hint = #xegpu.cache_hint<uncached>}> : vector<24x2xf16>, !xegpu.tensor_desc<24x32xf16, #xegpu.sg_map<wi_layout = [1, 16], wi_data = [1, 1]>>
+  xegpu.store_nd %1, %2 <{l1_hint = #xegpu.cache_hint<write_back>, l2_hint = #xegpu.cache_hint<uncached>}>: vector<24x2xf16>, !xegpu.tensor_desc<24x32xf16, #xegpu.sg_map<wi_layout = [1, 16], wi_data = [1, 1]>>
+  gpu.return
+}
+
 // CHECK: gpu.func @test_create_update_nd_tdesc_vc(%[[arg0:.*]]: memref<24x32xf32>) {
 gpu.func @test_create_update_nd_tdesc_vc(%src: memref<24x32xf32>) {
   // CHECK: %[[REG:.*]] = xegpu.create_nd_tdesc %arg0[0, 0] : memref<24x32xf32> -> !xegpu.tensor_desc<8x16xf32>
diff --git a/mlir/test/Dialect/XeGPU/invalid.mlir b/mlir/test/Dialect/XeGPU/invalid.mlir
index f8a0d95bd70a27..155131ba9e6d50 100644
--- a/mlir/test/Dialect/XeGPU/invalid.mlir
+++ b/mlir/test/Dialect/XeGPU/invalid.mlir
@@ -32,7 +32,7 @@ func.func @test_create_nd_tdesc_vc_4(%src: memref<2x24x32xf32, 3>) {
 // -----
 func.func @test_prefetch_nd_vc_1(%src: memref<24x32xf16>) {
   %1 = xegpu.create_nd_tdesc %src[0, 0] : memref<24x32xf16> -> !xegpu.tensor_desc<8x16xf16>
-  // expected-error@+1 {{invlid l1_hint: #xegpu.cache_hint<write_back>}}
+  // expected-error@+1 {{invalid l1_hint: #xegpu.cache_hint<write_back>}}
   xegpu.prefetch_nd %1 <{l1_hint = #xegpu.cache_hint<write_back>}>: !xegpu.tensor_desc<8x16xf16>
   return
 }
@@ -51,7 +51,7 @@ func.func @test_prefetch_nd_vc_2(%src: memref<24xf16>) {
 // -----
 func.func @test_load_nd_vc_1(%src: memref<8x16xf16>) {
   %1 = xegpu.create_nd_tdesc %src[0, 0] : memref<8x16xf16> -> !xegpu.tensor_desc<8x16xf16>
-  // expected-error@+1 {{invlid l1_hint: #xegpu.cache_hint<write_back>}}
+  // expected-error@+1 {{invalid l1_hint: #xegpu.cache_hint<write_back>}}
   %2 = xegpu.load_nd %1 <{l1_hint = #xegpu.cache_hint<write_back>}>
       : !xegpu.tensor_desc<8x16xf16> -> vector<4x16x2xf16>
   return
@@ -81,7 +81,7 @@ func.func @test_load_nd_vc_3(%src: memref<8x16xf16>) {
 func.func @test_store_nd_vc_1(%dst: memref<24x32xf16>) {
   %1 = arith.constant dense<1.0>: vector<24x32xf16>
   %2 = xegpu.create_nd_tdesc %dst[0, 0] : memref<24x32xf16> -> !xegpu.tensor_desc<24x32xf16>
-  // expected-error@+1 {{invlid l1_hint: #xegpu.cache_hint<streaming>}}
+  // expected-error@+1 {{invalid l1_hint: #xegpu.cache_hint<streaming>}}
   xegpu.store_nd %1, %2 <{l1_hint = #xegpu.cache_hint<streaming>}>: vector<24x32xf16>, !xegpu.tensor_desc<24x32xf16>
   return
 }
@@ -147,7 +147,7 @@ func.func @test_prefetch_vc_2(%src: ui64) {
   %0 = arith.constant dense<[0, 8, 16, 24]> : vector<4xindex>
   %1 = xegpu.create_tdesc %src, %0 : ui64, vector<4xindex>
           -> !xegpu.tensor_desc<4x2xf32, #xegpu.scatter_tdesc_attr<chunk_size = 2>>
-  // expected-error@+1 {{invlid l1_hint: #xegpu.cache_hint<write_back>}}
+  // expected-error@+1 {{invalid l1_hint: #xegpu.cache_hint<write_back>}}
   xegpu.prefetch %1 <{l1_hint = #xegpu.cache_hint<write_back>}>: !xegpu.tensor_desc<4x2xf32, #xegpu.scatter_tdesc_attr<chunk_size = 2>>
   return
 }
@@ -168,7 +168,7 @@ func.func @test_load_gather_vc_2(%src: ui64) {
   %0 = arith.constant dense<1>: vector<4xi1>
   %1 = xegpu.create_tdesc %src, %cst : ui64, vector<4xindex>
         -> !xegpu.tensor_desc<4x2xf32, #xegpu.scatter_tdesc_attr<chunk_size = 2>>
-  // expected-error@+1 {{invlid l1_hint: #xegpu.cache_hint<write_back>}}
+  // expected-error@+1 {{invalid l1_hint: #xegpu.cache_hint<write_back>}}
   %2 = xegpu.load %1, %0 <{l1_hint = #xegpu.cache_hint<write_back>}>
         : !xegpu.tensor_desc<4x2xf32, #xegpu.scatter_tdesc_attr<chunk_size = 2>>, vector<4xi1>
           -> vector<4x2xf32>
@@ -193,7 +193,7 @@ func.func @test_store_scatter_vc_2(%src: ui64) {
   %1 = arith.constant dense<2.9>: vector<4x2xf32>
   %2 = xegpu.create_tdesc %src, %cst : ui64, vector<4xindex>
               -> !xegpu.tensor_desc<4x2xf32, #xegpu.scatter_tdesc_attr<chunk_size = 2>>
-  // expected-error@+1 {{invlid l1_hint: #xegpu.cache_hint<streaming>}}
+  // expected-error@+1 {{invalid l1_hint: #xegpu.cache_hint<streaming>}}
   xegpu.store %1, %2, %0 <{l1_hint = #xegpu.cache_hint<streaming>}> : vector<4x2xf32>,
           !xegpu.tensor_desc<4x2xf32, #xegpu.scatter_tdesc_attr<chunk_size = 2>>, vector<4xi1>
   return

mlir/lib/Dialect/XeGPU/IR/XeGPUOps.cpp

adam-smnk · 2024-12-19T16:27:55Z

Just first impressions.
I feel that option 2 is quite unintuitive due to this mix of abstractions. Memory descriptors operate at the workgroup level while individual operation implicitly map to threads. At the same time, it is equally undesirable to lose information which seems to happen in option 1.
Option 3 is an alternative design choice but it's just moving complexity around.

Ideally, all sg_maps should be consumed at the time of distribution to have clearer separation of abstraction layers where you can run dedicated transformations aiming at either SIMD or SIMT.

Currently I'm mostly leaning toward the 4th idea. Subview of a descriptor could work or maybe the load operation itself could contain some optional offsets, e.g. mapping it to thread ID.
Ideally, it would be in a format that allows to easily identify whether IR is pre- or post-distribution.

kurapov-peter · 2024-12-19T17:37:41Z

I feel that option 2 is quite unintuitive due to this mix of abstractions.

Yup, I think so too.

Ideally, all sg_maps should be consumed at the time of distribution

Yes, that could be a good indication for the pass itself (e.g., as a return condition) to do the distribution.

Currently I'm mostly leaning toward the 4th idea. Subview of a descriptor could work or maybe the load operation itself could contain some optional offsets, e.g. mapping it to thread ID.

Me too, that looks cleanest to me, so I'm suggesting we go there. For an offset I'm not sure what would it represent in a non-distributed case. You could do a vector of offsets that describes all the offsets for each individual lane but I think the definition of ops and sg_map wanted to avoid such an explicit representation. Anyway, there are clearly multiple ways of doing that.

Ideally, it would be in a format that allows to easily identify whether IR is pre- or post-distribution.

I think that's not necessary, the transformation in a sense just slices through the dataflow in a way. You theoretically could do that multiple times (here's when this starts resembling tiling, which you could argue it kinda is).

Jianhui-Li · 2024-12-20T00:01:39Z

Based on the nature of tensor descriptor, I don't think we should distribute it.

The tensor descriptor is a uniform value which all SIMD lanes share. There is only one copy of tensor descriptor being created for all SIMD lanes, and the creation doesn't involve lane id. The 2d block load is a collective operation that each lane takes the uniform tensor descriptor and load/store its own data fragment. The block shape size inside tensor descriptor can't be viewed exactly as the memref shape, which can naturally distribute and each individual thread computes its own address according to lane id. Instead, the computation of tensor descriptor involves no lane id, and all lanes should compute one same value, so that at the assembly level the tensor descriptor creation is done by only 1 thread.

If we distribute it, we need to reverse it back by merging the shape back to original block size and geting rid of the lane id from the computation. Also the distribution is non-trivial which makes the reversal process complex: the data fragment may have strides along 2 dimensions, so each thread may generate multiple addresses. I don't see it is worth doing it since I don't see any optimization we want or missed on this kind of distributed form.

This approach comes at the price of a vague IR-to-HW-constructs mapping: now the IR no longer represents something a single logical thread in SIMT owns and acts upon, so the abstraction layering is still broken.

To me, SIMT doesn't mean each thread must know exactly how to compute address using its own lane id. This is actually what the HW ISA try to avoid, since pre-lane addresses uses more registers than one uniform tensor descriptor.

This can potentially have some unexpected implications that we don't see today on other transformations and analyses (e.g., nd_load and nd_store are ops that should have memory side effects that are not implemented at the moment. It is unclear to me what implications this violation of "ownership" may have.).

I don't quite understand the "memory side effects" and "violation of owership" to debate. Maybe an example can help here.

In worst case, most of XeGPU level optimization is target dependent so we will have to take care this inconsistent shape issue that is target specific. I don't expect many target-independent optimization which we want XeGPU to be distributed in perfect SIMT flavor, but only see problems if we go that way as stated above. If you have example, you may point it out.

kurapov-peter · 2024-12-20T10:42:23Z

Are you suggesting we apply the second approach, ignore that this is very confusing, and just hope that there would be no problems with type mismatching in the future?

I also don't see any contradictions to the proposal (the 4th approach) by the way, which is preferable imo.

The tensor descriptor is a uniform value which all SIMD lanes share. There is only one copy of tensor descriptor being created for all SIMD lanes, and the creation doesn't involve lane id.

Yes, I know. The instruction does not live in a vacuum though. Since we are composing this with vector distribution, the resulting vector of a load would be consumed by some vector IR that describes what a logical thread does. It will have to consume some portion of the data that "belongs" to that thread. That portion is coming from a memref as described by the tensor descriptor. So there is already information about how to retrieve a specific data element(s) for a particular lane id, although now it is implicit. I don't see it as an argument against having a view into the descriptor since the view can exactly make that implicit assignment explicit. The information also wouldn't need to be restored down the pipeline since you retain the original create_tdesc op (which would be now conceptually performed by all lanes without any distinction or offset).

If we distribute it, we need to reverse it back by merging the shape back to original block size and geting rid of the lane id from the computation.

I don't think this is necessary, you just need to restore the parameters such as the size of the descriptor which can be done using the sg_map. The assignment of data chunks to lane ids is implicit. Anyway, this is neither what I'm proposing nor it is related to the patch.

I don't quite understand the "memory side effects" and "violation of owership" to debate. Maybe an example can help here.

I have no examples. All I'm stating is that I'm not aware of the consequences of messing with the types like that.

Jianhui-Li · 2024-12-20T15:56:35Z

I would not rush for option 4 also. If I understand correctly, the subview op needs to compute the offsets from id, and then create the subview with sizes, offsets, strides. I like the approach keeping the original tensor descriptor as a whole, but it requires adding a subview op which doesn't look like other XeGPU OPs (mapping to concrete hardware operations). I don't know what benefit it can bring at this point other than the IR appears less "confusing", which is a debatable point.

To me, whether the IR is "confusing" actually depends on how the IR is lowered or optimized. My view is actually reverse. The type mismatch doesn't bother me that much. But if the IR doesn't model the hardware behavior, say it introduces per-lane offsets/sizes computation which we don't need during the lowering, it causes a different type of confusion that bothers me more. The passes on XeGPU is mostly target-specific, so people likes to match the IR with what hardware behavior - each lane takes the whole shape and read back its own data fragments implicitly, instead of computing its own offsets/sizes. If transformation/optimization needs to know the data fragment distribution, they can refer to sg_map that was designed to explicitly describe the data distribution.

I suggest we first go with option 2. When it become clear that we really need a subview op, we can revisit it.

Jianhui-Li · 2024-12-20T23:36:01Z

Actually I find this code example as I believe this is a common issue for collective subgroup operations. Nvgpu ldmatrix chooses to live with the mismatch between memref shape and vector.

llvm-project/mlir/test/Dialect/NVGPU/optimize-shared-memory.mlir

Line 31 in e5de2a2

    
           %mat = nvgpu.ldmatrix %shm[%fragRow, %fragCol] {numTiles = 4 : i32, transpose = false}

%mat = nvgpu.ldmatrix %shm[%fragRow, %fragCol] {numTiles = 4 : i32, transpose = false} : memref<128x32xf16, 3> -> vector<4x2xf16>

adam-smnk · 2024-12-23T13:34:53Z

The tensor descriptor is a uniform value which all SIMD lanes share. There is only one copy of tensor descriptor being created for all SIMD lanes, and the creation doesn't involve lane id.

Good to know, that adds more context to the proposed solution as I initially thought this distribution split could be more arbitrary.
It would be helpful to add these details to the dialect documentation as XeGPU operations start covering both cooperative and distributed abstractions.

After closer looks and going through the detailed explanations (thanks @Jianhui-Li), the main competing options 2 and 4 revolve around making the distribution implicit(-ish as it is still captured by sg_map) vs explicit new nop subview. The mismatch is fine as option 2 distributed load_nd has built-in subview that follows strict semanctics captured by descriptor's map.

The instruction does not live in a vacuum though. Since we are composing this with vector distribution, the resulting vector of a load would be consumed by some vector IR that describes what a logical thread does. It will have to consume some portion of the data that "belongs" to that thread.

That's been my concern, however, so far I couldn't come up with a case that would break without explicit mapping. Primarily, as I imagine most complex transformations would occur still at SIMD/cooperative level before SIMT.
If consumers expect different mapping compared to the one described by the load, then distribution might fail or extra step that changes layout through shared memory might be necessary anyway.

I suggest we first go with option 2. When it become clear that we really need a subview op, we can revisit it.

+1 on taking option 2 as the first step as it is simpler - no extra ops, changes that don't break any existing code.

mlir/test/Dialect/XeGPU/invalid.mlir

charithaintc · 2025-01-09T19:08:24Z

I would not rush for option 4 also. If I understand correctly, the subview op needs to compute the offsets from id, and then create the subview with sizes, offsets, strides. I like the approach keeping the original tensor descriptor as a whole, but it requires adding a subview op which doesn't look like other XeGPU OPs (mapping to concrete hardware operations). I don't know what benefit it can bring at this point other than the IR appears less "confusing", which is a debatable point.

To me, whether the IR is "confusing" actually depends on how the IR is lowered or optimized. My view is actually reverse. The type mismatch doesn't bother me that much. But if the IR doesn't model the hardware behavior, say it introduces per-lane offsets/sizes computation which we don't need during the lowering, it causes a different type of confusion that bothers me more. The passes on XeGPU is mostly target-specific, so people likes to match the IR with what hardware behavior - each lane takes the whole shape and read back its own data fragments implicitly, instead of computing its own offsets/sizes. If transformation/optimization needs to know the data fragment distribution, they can refer to sg_map that was designed to explicitly describe the data distribution.

I suggest we first go with option 2. When it become clear that we really need a subview op, we can revisit it.

Agreed. +1 for option 2. XeGPU is primarily designed to faithfully represent HW details in block Load/Store. So in my view, approach 1 violates this philosophy for not clearly apparent benefits.

I don't think this is necessary, you just need to restore the parameters such as the size of the descriptor which can be done using the sg_map. The assignment of data chunks to lane ids is implicit. Anyway, this is neither what I'm proposing nor it is related to the patch.

This argument is not clear to me. Are you saying that sg_map must be present even after the SIMT distribution? That seems like an unnecessary demand because after SIMT we don't really need sg_map. Also what if some user wants to directly generate XeGPU SIMT? I maybe misunderstood but this need some clarification.

…the distributed IR case.

kurapov-peter · 2025-01-14T14:36:02Z

Seems like there's consensus on going with the option 2 now. We can merge this then unless there're additional comments.

This argument is not clear to me. Are you saying that sg_map must be present even after the SIMT distribution? That seems like an unnecessary demand because after SIMT we don't really need sg_map.

By the way, this validation relies on sg_map to check the shapes. There is currently no other source of information to validate the shapes.

adam-smnk · 2025-01-14T14:45:04Z

That seems like an unnecessary demand because after SIMT we don't really need sg_map. Also what if some user wants to directly generate XeGPU SIMT?

Seems to me that sg_map is essentially required which, IMO, is fine on its own. We just need to clearly document and/or enforce it somehow such that users can easily figure out what to do.

Jianhui-Li · 2025-01-14T18:20:10Z

That seems like an unnecessary demand because after SIMT we don't really need sg_map. Also what if some user wants to directly generate XeGPU SIMT?

Seems to me that sg_map is essentially required which, IMO, is fine on its own. We just need to clearly document and/or enforce it somehow such that users can easily figure out what to do.

I am OK to keep sg_map as required as well. It simplifies the IR validation, and the user (as a compiler component) should know the SIMT distribution rule exactly therefore won't be extra burden to attach sg_map attribute for XeGPU Ops.

chencha3 · 2025-01-15T15:56:52Z

mlir/lib/Dialect/XeGPU/IR/XeGPUOps.cpp

+//   each dimension.
+static bool isArgShapesValid(ArrayRef<int64_t> descShape,
+                             ArrayRef<int64_t> valShape, SGMapAttr sgMap) {
+  if (descShape == valShape)


what if descShape == valShape and sgMap is valid? does it mean the sgMap will be discarded?

Sorry, I didn't get the question. If the shapes are equal that means the nd load/store is valid and not distributed (if there's an sg map, only sg map's validity is checked by SGMapAttr::verify). If the shapes don't match then we either have a distributed or an invalid case. For distributed we check that ranks are the same, the sg map is present and the scaled values for each dimension match.

NVM, it seems a dumb question. It seems to me that, there are currently 3 stages: 1) pure SIMD code, sgMap == null, and descShape == valShape. 2). a valid sgMap is attached to guide the lowering, but the code is not rewritten yet, so we have descShape == valShape and sgMap != null, but sgMap is not effective yet. 3). code is rewritten, so descShape != valShape, and sgMap != null, and sgMap is now effective.

Yup, those are all valid states

chencha3 · 2025-01-15T15:57:21Z

mlir/lib/Dialect/XeGPU/IR/XeGPUOps.cpp

@@ -210,13 +236,13 @@ LogicalResult PrefetchNdOp::verify() {
    return emitOpError("Expects a non-scattered TensorDesc.\n");

  if (!isReadHintOrNone(getL1HintAttr()))
-    return emitOpError("invlid l1_hint: ") << getL1HintAttr();
+    return emitOpError("invalid l1_hint: ") << getL1HintAttr();



Thanks for fixing these typos.

chencha3 · 2025-01-15T16:42:12Z

It looks to me the current implementation is following option 2, am I right?
I agree with adam that the sgMap should be gone in a clean design. And I also agree with jianhui that it is not convenient to distribute TensorDesc at this stage due to its nature. I like the idea of option 4, it keeps the original TensorDesc info and also gets rid of the sgMap after lowering. But at the same time, I feel it also has a draw back. Can users create arbitrary, or a set, subviews? Based on what I know so far, I feel the answer is no. It seems there is only one view available, which is implicitly defined the intrinsic. Which means the sgMap is also fixed, and there is only one valid mapping. Can we drop the sgMap in future and derive it from kernel attributes? We may use the presence of VectorComputeINTEL attribute to indicate the kernel is for SIMD, which will trigger the SIMD logic, otherwise it is a SIMT kernel and trigger the SIMT logic. Is it safe?

kurapov-peter · 2025-01-15T17:11:11Z

It looks to me the current implementation is following option 2, am I right?

Yes, these validation changes unblock option 2

Can we drop the sgMap in future and derive it from kernel attributes? We may use the presence of VectorComputeINTEL attribute to indicate the kernel is for SIMD, which will trigger the SIMD logic, otherwise it is a SIMT kernel and trigger the SIMT logic. Is it safe?

Deriving from an attribute likely means the map would be essentially the same for all operations. I suspect this might be too restrictive, in presence of packing for example. Other than that, I don't see any technical problems with getting this information not from the type attribute but from somewhere else like a kernel attribute (which would probably be the same sg map but attached to a kernel?).

chencha3 · 2025-01-15T17:22:12Z

Deriving from an attribute likely means the map would be essentially the same for all operations. I suspect this might be too restrictive, in presence of packing for example. Other than that, I don't see any technical problems with getting this information not from the type attribute but from somewhere else like a kernel attribute (which would probably be the same sg map but attached to a kernel?).

It is not necessary to require all operations to have a same mapping rule, since they have different semantics. But it requires each op has exactly one mapping rule.

Garra1980 · 2025-01-17T18:58:36Z

mlir/test/Dialect/XeGPU/XeGPUOps.mlir

@@ -86,6 +86,17 @@ gpu.func @test_load_nd_vc_2(%src: memref<8x16xf16>) {
  gpu.return
 }

+// load_nd args may have different shapes, validated against sg_map
+// CHECK: func @test_load_nd_vc_3(%[[arg0:.*]]: memref<24x32xf32>) {
+gpu.func @test_load_nd_vc_3(%src: memref<24x32xf32>) {


@chencha3 do you remember what does 'vc' suffix stand for? I feel like for SIMT-ish examples it is not relevant :)

llvmbot added mlir:gpu mlir labels Dec 19, 2024

kurapov-peter mentioned this pull request Dec 19, 2024

Add xegpu distribution patterns intel/graph-compiler#424

Draft

kurapov-peter commented Dec 19, 2024

View reviewed changes

mlir/lib/Dialect/XeGPU/IR/XeGPUOps.cpp Show resolved Hide resolved

adam-smnk reviewed Dec 23, 2024

View reviewed changes

mlir/test/Dialect/XeGPU/invalid.mlir Outdated Show resolved Hide resolved

kurapov-peter added 2 commits January 14, 2025 11:07

[MLIR][XeGPU] Allow some nd ops to have argument shapes mismatch for …

b1cef75

…the distributed IR case.

Validate tensor descriptor rank against a value rank

103db33

kurapov-peter force-pushed the xegpu-nd-ops-validation-relaxation branch from 49ea091 to 103db33 Compare January 14, 2025 11:48

Add invalid IR test for distribution type mismatch

dfca5d6

chencha3 reviewed Jan 15, 2025

View reviewed changes

Garra1980 reviewed Jan 17, 2025

View reviewed changes

adam-smnk mentioned this pull request Jan 22, 2025

[MLIR][XeGPU] Account for sg_map in LoadNdOp verification #123928

Closed

chencha3 approved these changes Jan 22, 2025

View reviewed changes

Prevent non 2d shaped loads/stores to have an sg_map

9213451

adam-smnk approved these changes Jan 22, 2025

View reviewed changes

kurapov-peter merged commit fa6f88a into llvm:main Jan 22, 2025
8 checks passed

kurapov-peter deleted the xegpu-nd-ops-validation-relaxation branch January 22, 2025 17:03

adam-smnk mentioned this pull request Jan 27, 2025

[mlir][xegpu] Tensor descriptor type verifier #124548

Merged

akroviakov mentioned this pull request Jan 29, 2025

[MLIR][XeGPU] Add sg_map for scatter verification #124300

Merged

[MLIR][XeGPU] Allow some nd ops to have argument shapes mismatch for … #120566

[MLIR][XeGPU] Allow some nd ops to have argument shapes mismatch for … #120566

Uh oh!

Conversation

kurapov-peter commented Dec 19, 2024

Uh oh!

llvmbot commented Dec 19, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Uh oh!

adam-smnk commented Dec 19, 2024

Uh oh!

kurapov-peter commented Dec 19, 2024

Uh oh!

Jianhui-Li commented Dec 20, 2024

Uh oh!

kurapov-peter commented Dec 20, 2024

Uh oh!

Jianhui-Li commented Dec 20, 2024

Uh oh!

Jianhui-Li commented Dec 20, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

adam-smnk commented Dec 23, 2024

Uh oh!

Uh oh!

charithaintc commented Jan 9, 2025

Uh oh!

kurapov-peter commented Jan 14, 2025

Uh oh!

adam-smnk commented Jan 14, 2025

Uh oh!

Jianhui-Li commented Jan 14, 2025

Uh oh!

chencha3 Jan 15, 2025

Choose a reason for hiding this comment

Uh oh!

kurapov-peter Jan 15, 2025

Choose a reason for hiding this comment

Uh oh!

chencha3 Jan 15, 2025

Choose a reason for hiding this comment

Uh oh!

kurapov-peter Jan 16, 2025

Choose a reason for hiding this comment

Uh oh!

chencha3 Jan 15, 2025

Choose a reason for hiding this comment

Uh oh!

chencha3 commented Jan 15, 2025

Uh oh!

kurapov-peter commented Jan 15, 2025

Uh oh!

chencha3 commented Jan 15, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Garra1980 Jan 17, 2025

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

llvmbot commented Dec 19, 2024 •

edited

Loading

Jianhui-Li commented Dec 20, 2024 •

edited

Loading

chencha3 commented Jan 15, 2025 •

edited

Loading