Skip to content

[mlir][linalg] Simplify vectorization test output using -canonicalize -cse #138265

Open
@banach-space

Description

@banach-space

The Linalg vectorization tests are currently quite complex and hard to navigate (see full list with links below). One area I’d like to improve is simplifying the expected test output by updating the mlir-opt invocation to include:

  • -canonicalize -cse.

Why add -cse?

CSE alone is a huge win. It eliminates redundant constants like:

%c0 = arith.constant  0 : index
%c0_1 = arith.constant  0 : index
%c0_2 = arith.constant  0 : index

Without CSE, test updates often involve unnecessarily matching different SSA values representing the same constant, which adds noise and overhead.

Why add -canonicalize?

Adding -canonicalize helps simplify tensor.dim, affine.apply, and other commonly duplicated constructs.

Current output from the vectorizer

  func.func @test_masked_vectorize_dynamic_pad(%arg0: tensor<?x?xf32>, %arg1: index, %arg2: index) -> tensor<?x?xf32> {
    %cst = arith.constant 4.243000e+01 : f32
    %c0 = arith.constant 0 : index
    %c0_0 = arith.constant 0 : index
    %dim = tensor.dim %arg0, %c0_0 : tensor<?x?xf32>
    %0 = affine.apply #map()[%arg1, %dim]
    %c1 = arith.constant 1 : index
    %dim_1 = tensor.dim %arg0, %c1 : tensor<?x?xf32>
    %1 = affine.apply #map()[%arg2, %dim_1]
    %c0_2 = arith.constant 0 : index
    %c0_3 = arith.constant 0 : index
    %dim_4 = tensor.dim %arg0, %c0_3 : tensor<?x?xf32>
    %c1_5 = arith.constant 1 : index
    %dim_6 = tensor.dim %arg0, %c1_5 : tensor<?x?xf32>
    %2 = vector.create_mask %dim_4, %dim_6 : vector<2x4xi1>
    %3 = vector.mask %2 { vector.transfer_read %arg0[%c0_2, %c0_2], %cst {in_bounds = [true, true]} : tensor<?x?xf32>, vector<2x4xf32> } : vector<2x4xi1> -> vector<2x4xf32>
    %4 = tensor.empty(%0, %1) : tensor<?x?xf32>
    %c0_7 = arith.constant 0 : index
    %c0_8 = arith.constant 0 : index
    %dim_9 = tensor.dim %4, %c0_8 : tensor<?x?xf32>
    %c1_10 = arith.constant 1 : index
    %dim_11 = tensor.dim %4, %c1_10 : tensor<?x?xf32>
    %5 = vector.create_mask %dim_9, %dim_11 : vector<2x4xi1>
    %6 = vector.mask %5 { vector.transfer_write %3, %4[%c0_7, %c0_7] {in_bounds = [true, true]} : vector<2x4xf32>, tensor<?x?xf32> } : vector<2x4xi1> -> tensor<?x?xf32>
    return %6 : tensor<?x?xf32>
  }

There is a lot of duplication of arith.constant and tensor.dim.

Output from the vectorizer after adding -cse:

  func.func @test_masked_vectorize_dynamic_pad(%arg0: tensor<?x?xf32>, %arg1: index, %arg2: index) -> tensor<?x?xf32> {
    %cst = arith.constant 4.243000e+01 : f32
    %c0 = arith.constant 0 : index
    %dim = tensor.dim %arg0, %c0 : tensor<?x?xf32>
    %0 = affine.apply #map()[%arg1, %dim]
    %c1 = arith.constant 1 : index
    %dim_0 = tensor.dim %arg0, %c1 : tensor<?x?xf32>
    %1 = affine.apply #map()[%arg2, %dim_0]
    %2 = vector.create_mask %dim, %dim_0 : vector<2x4xi1>
    %3 = vector.mask %2 { vector.transfer_read %arg0[%c0, %c0], %cst {in_bounds = [true, true]} : tensor<?x?xf32>, vector<2x4xf32> } : vector<2x4xi1> -> vector<2x4xf32>
    %4 = tensor.empty(%0, %1) : tensor<?x?xf32>
    %dim_1 = tensor.dim %4, %c0 : tensor<?x?xf32>
    %dim_2 = tensor.dim %4, %c1 : tensor<?x?xf32>
    %5 = vector.create_mask %dim_1, %dim_2 : vector<2x4xi1>
    %6 = vector.mask %5 { vector.transfer_write %3, %4[%c0, %c0] {in_bounds = [true, true]} : vector<2x4xf32>, tensor<?x?xf32> } : vector<2x4xi1> -> tensor<?x?xf32>
    return %6 : tensor<?x?xf32>
  }

No duplication of arith.constant, but tensor.dim is still unnecessarily duplicated.

Output from the vectorizer after adding -canonicalize -cse:

  func.func @test_masked_vectorize_dynamic_pad(%arg0: tensor<?x?xf32>, %arg1: index, %arg2: index) -> tensor<?x?xf32> {
    %c1 = arith.constant 1 : index
    %cst = arith.constant 4.243000e+01 : f32
    %c0 = arith.constant 0 : index
    %dim = tensor.dim %arg0, %c0 : tensor<?x?xf32>
    %0 = affine.apply #map()[%arg1, %dim]
    %dim_0 = tensor.dim %arg0, %c1 : tensor<?x?xf32>
    %1 = affine.apply #map()[%arg2, %dim_0]
    %2 = vector.create_mask %dim, %dim_0 : vector<2x4xi1>
    %3 = vector.mask %2 { vector.transfer_read %arg0[%c0, %c0], %cst {in_bounds = [true, true]} : tensor<?x?xf32>, vector<2x4xf32> } : vector<2x4xi1> -> vector<2x4xf32>
    %4 = tensor.empty(%0, %1) : tensor<?x?xf32>
    %5 = vector.create_mask %0, %1 : vector<2x4xi1>
    %6 = vector.mask %5 { vector.transfer_write %3, %4[%c0, %c0] {in_bounds = [true, true]} : vector<2x4xf32>, tensor<?x?xf32> } : vector<2x4xi1> -> tensor<?x?xf32>
    return %6 : tensor<?x?xf32>
  }

No duplication :)

Pros vs Cons

Pros:

  • Easier to focus on the semantic intent of vectorization output.
  • Reduces test maintenance (less duplication, fewer fragile SSA names).
  • Aligns with FileCheck best practices: test the minimal necessary.

Cons:

  • Tests will now depend on CSE and canonicalization, making them indirectly sensitive to unrelated changes.
  • Tests will no longer isolate vectorization alone - they will validate a pipeline of transformations.

Next steps

While there are trade-offs, I believe this change will be beneficial overall.

My first patch is here:

Assuming there are no strong objections, I’d like to use this issue for discussion and long-term context.
CC @dcaballe @hanhanW - you've reviewed most of my patches in this area. Anyone else I should include?

Thanks!

List of test files

Metadata

Metadata

Assignees

Type

No type

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions