[ET-VK] Tuning native layer norm local workgroup size to improve thread occupancy during reduce.

trivedivivek · trivedivivek · commit 38c4c77cc15c · 2025-04-14T07:29:29.000-07:00
Pull Request resolved: #9984 This diff is tuning the local workgroup size of the native layer norm operation in Vulkan backend of Executorch to improve thread occupancy during the reduce phase. ghstack-source-id: 277933491 Differential Revision: [D72581293](https://our.internmc.facebook.com/intern/diff/D72581293/)
diff --git a/backends/vulkan/runtime/graph/ops/impl/NativeLayerNorm.cpp b/backends/vulkan/runtime/graph/ops/impl/NativeLayerNorm.cpp
@@ -84,7 +84,18 @@ void add_native_layer_norm_node(
   std::vector<int64_t> in_sizes = t_input->sizes();
 
   utils::uvec3 global_size = t_out->logical_limits();
-  utils::uvec3 local_size = graph.create_local_wg_size(global_size);
+  utils::uvec3 local_size;
+
+  // Since the shader sets shared memory scale factor > 1, if dispatch is
+  // greater than maximum WG size. Setting WG size in X axis to max WG size,
+  // would allow best thread utilization.
+  if (global_size[0] > 64) {
+    local_size = {64, 1, 1};
+  } else {
+    // If thread size in X axis is smaller or equal to maximum WG size, we can
+    // let the function decide the best WG size.
+    local_size = graph.create_local_wg_size(global_size);
+  }
 
   std::string kernel_name("native_layer_norm");
   kernel_name.reserve(kShaderNameReserve);