[RFC][flang] Add support for assumed-shape dummy arrays repacking. #127147

vzakhari · 2025-02-13T23:37:45Z

This is a document describing why and how to add support for repacking
of assumed-shape dummy arrrays to provide more efficient data cache.
It proposes adding new FIR operations and outlines the compiler flow
handling these operations.
I would like to hear feedback on all of it, but especially on:

The possibility of detecting safeness of the repacking
in the context of OpenACC/OpenMP. If it is not possible
to do the runtime checks to determine safety, then
there is not need to add the TempCopyIsSafe attributes
to the instruction.
Whether it is possible to preserve the debug information
in cases where fir.pack_array is sunk after [hl]fir.declare,
so that before the fir.pack_array a debugger will refer
to the values in the original array, and after fir.pack_array
it will refer to the copy.

This is a document describing why and how to add support for repacking of assumed-shape dummy arrrays to provide more efficient data cache. It proposes adding new FIR operations and outlines the compiler flow handling these operations. I would like to hear feedback on all of it, but especially on: * The possibility of detecting safeness of the repacking in the context of OpenACC/OpenMP. If it is not possible to do the runtime checks to determine safety, then there is not need to add the `TempCopyIsSafe` attributes to the instruction. * Whether it is possible to preserve the debug information in cases where `fir.pack_array` is sunk after `[hl]fir.declare`, so that before the `fir.pack_array` a debugger will refer to the values in the original array, and after `fir.pack_array` it will refer to the copy.

kiranchandramohan · 2025-02-16T10:38:38Z

Thanks @vzakhari for the detailed document about repacking arrays.

We implemented a loop versioning pass (https://discourse.llvm.org/t/rfc-loop-versioning-for-unit-stride/68605, https://flang.llvm.org/docs/FlangCommandLineReference.html#cmdoption-flang-fversion-loops-for-stride) that versions the loop for better cache locality, prevent scatter/gathers etc. We get around 36% improvement in the spec2017 roms benchmark due to this pass. Since this pass serves a similar usecase as repack arrays, the pass can be switched OFF when 'repack-arrays' is ON.

BTW, Is the performance data that you collected with the loop versioning pass ON? Or is it purely artificial?

There were some comments about issues with versioning/repacking with the asynchronous attribute and the contiguous attribute. It might be good to add these points to the document. https://discourse.llvm.org/t/transformations-to-aid-optimizer-for-subroutines-functions-with-assumed-shape-arguments/66447

I have added AMD engineers working on offloading to have a look for the offloading questions.

…sioning.

vzakhari · 2025-02-18T20:16:02Z

Thanks @vzakhari for the detailed document about repacking arrays.

Thank you for reading, Kiran!

We implemented a loop versioning pass (https://discourse.llvm.org/t/rfc-loop-versioning-for-unit-stride/68605, https://flang.llvm.org/docs/FlangCommandLineReference.html#cmdoption-flang-fversion-loops-for-stride) that versions the loop for better cache locality, prevent scatter/gathers etc. We get around 36% improvement in the spec2017 roms benchmark due to this pass. Since this pass serves a similar usecase as repack arrays, the pass can be switched OFF when 'repack-arrays' is ON.

I think we may get better results with both array repacking and the loop versioning. I added a section about this into the document.

BTW, Is the performance data that you collected with the loop versioning pass ON? Or is it purely artificial?

I collected the performance numbers using gfortran, so they are not artificial :) Flang does not have array repacking right now, but the manual REPACKING version shows the same speed-up as gfortran's array repacking. The times are the same regardless of the loop versioning. I believe in capacita case none of the compilers vectorizes the hot loop, because of unknown data dependencies, so all the speed-up from the array repacking is coming from better data cache utilization.

There were some comments about issues with versioning/repacking with the asynchronous attribute and the contiguous attribute. It might be good to add these points to the document. https://discourse.llvm.org/t/transformations-to-aid-optimizer-for-subroutines-functions-with-assumed-shape-arguments/66447

Thanks for the link! The document covers ASYNCHRONOUS and VOLATILE cases, and it also mentions that the array repacking should be applied to the dummy arguments that are CONTIGUOUS.

I have added AMD engineers working on offloading to have a look for the offloading questions.

Thanks!

tblah

Thanks for all of your work so far on this proposal. I like the design and think it will be a lot easier to extend than the existing loop versioning pass.

You raise a very interesting point with the visibility of side effects to array variables accessed via multiple threads. I agree making this change could change the result of such programs, but I wonder if the ordering of the visibility of different array accesses between threads can be depended on by conformant programs in the absence of memory barriers, ASYNCHRONOUS, or VOLATILE. You are right that we cannot know inside of a function whether threading is enabled, but if the visibility of memory accesses is only guaranteed by barriers, we can inspect and see whether any memory barriers are present inside of this function (doing this might need new dialect interfaces).

Other than shared variables, another possibility in openmp is that array variables are privatized. Then a copy is made of the source variable already. This is another case to be considered in the optimization passes you mentioned.

tblah · 2025-02-19T11:00:37Z

flang/docs/ArrayRepacking.md

+* `innermost` - tells that the repacking has to be done iff the array is not contiguous in the innermost dimension. This also describes what type of continuity can be expected from `%new_var`, i.e. `innermost` means that the resulting array is definitely contiguous in the innermost dimension, but may be non-contiguous in other dimensions (unless additional analysis proves otherwise). For 1-D arrays, `innermost` attribute is not valid.
+* `no_copy` - indicates that, in case a temporary array is created, `%var` to `%new_var` copy is not required (`intent(out)` dummy argument case).
+* `heuristics`
+  * `loop-only` - `fir.pack_array` can be optimized away, if the array is not used in a loop.


When would this not be true?

A case when an array is not used in a loop but may still benefit from repacking is when an assumed-shape dummy array is passed to another subprogram as an assumed-shape dummy argument. It may be profitable to do the repacking early in some cases when HLFIR/FIR inlining is not enabled (for whatever reason), e.g.:

subroutine s1(x) real :: x(:) call s2(x) end subroutine s2(x) real :: x(:) do i=1,100 call s3(x) end end subroutine s3(x) real :: x(:) do i=1,1000 <use array x> end end

Postponing the dynamic repacking to subroutine s3 is suboptimal, because the repacking will be done 100 times. Repacking the array once in subroutine s1 is better.

With HLFIR/FIR inlining and further fir.pack_array optimizations this case should work well even with loop-only heuristic, because the optimization passes should figure out that the array is eventually used in a loop, and fir.pack_array's in subroutines s2 and s3 are redundant.

I would like to keep the flexibility for the users to pick different strategies for repacking depending on their needs and their ability to use inlining and other optimizations.

Sounds good to me. Thank you for the detailed explanation!

tblah · 2025-02-19T11:05:26Z

flang/docs/ArrayRepacking.md

+  * `max-size` - constant integer attribute specifying the maximum byte size of an array that is eligible for repacking.
+  * `max-element-size` - constant integer attribute specifying the maximum byte element-size of an array that is eligible for repacking.
+  * `min-stride` - constant integer attribute specifying the minimum byte stride of the innermost dimension of an array that is eligible for repacking.


Do you anticipate setting these differently for different pack/unpack operations? I would have thought these could be expressed more simply as a pass parameter.

Yes, I would like to provide the flexibility to choose different repacking settings for different compilation modules. Eventually, when cross-module HLFIR/FIR inlining is working, the operations from different modules may have different settings.

tblah · 2025-02-19T11:10:40Z

flang/docs/ArrayRepacking.md

+* If there is no `no_copy`:
+  * `MemRead` effect on unknown value to indicate potential read from the original array.
+    * [TBD] we can relax that by having an additional argument taking `fir.box_addr %var` value, though, this adds some redundancy to the argument list.
+  * `MemWrite` effect on unknown value to indicate potential write into the temporary array.


I don't think we need memory effects to model that this operation modifies the SSA value that it creates.

I understand that it might be creating a descriptor and the write is actually in the allocated memory, but I think that still counts as initializing the descriptor in this case. For example, the initialization of the allocated memory is dead if the ssa value for the descriptor is dead.

I think the alloc effect is needed. The most conservative way to lower a pair of pack_array/unpack_array that use the stack allocation is to introduce stacksave/stackrestore for them. If there is no alloc effect, then two pack_array ops may potentially be reordered by some pass like this:

%1 = fir.pack_array %0 = fir.pack_array fir.unpack_array %1 fir.unpack_array %0

This will result in incorrect stackrestore behavior.

Moreover, see comments below on the passes that may move operations around.

This makes sense. Thank you for explaining.

tblah · 2025-02-19T11:13:02Z

flang/docs/ArrayRepacking.md

+
+Alias analysis:
+
+* For the purpose of alias analysis `fir.pack_array` should be considered a pass-though operation, and its value should be treated as `MayAlias` with the original array.


I think this is too pessimistic. I think this can be modeled as a new allocation.

I agree. Either the new array is a MustAlias (because the memory modifications to the copy were intended to be visible modifications to original array) or it is a NoAlias because we decide to model it as a copy. Because it is possible no copy is created, it seems that a MustAlias is probably the better choice. In which case we should not need the MemRead/MemWrite effects on the pack operation itself (we may still want some other resource though to capture the potential allocation though)

Okay, I think it might be clearer here to just say that fir.pack_array should be considered a pass-through operation for the purpose of alias analysis, i.e. when looking for the source of a pointer the alias analysis should just pass-through from the result box to the source box and proceed as usual. The final MayAlias or MustAlias will depend if they arrive at the same source but with different address arithmetic in the middle. Does it sound correct?

I strongly believe MemRead/MemWrite effects are needed on the new operations.

Consider the following example, where the two subroutines are compiled with and w/o packing and FIR inlining happens:

subroutine s1(x) real :: x(:) // not packed x(1) = 1.0 call s2(x) // inlined end subroutine s2(x) real :: x(:) // packed print *, x(1) end

I will express it in this pseudo-MLIR:

s1: %orig_box = ... %orig_base = fir.box_addr %orig_box fir.store 1.0 to %orig_base %packed_box = fir.pack_array %orig_box %packed_base = fir.box_addr %packed_box %val = fir.load %packed_base print %val fir.unpack_array %packed_box to %orig_box

A pass cannot hoist fir.pack_array above the fir.store (or sink fir.store below fir.pack_array), so fir.pack_array needs MemRead effect on the original array. Note that MemAlloc effect is not enough here: if fir.store accesses a memory before fir.pack_array, then the fact that fir.pack_array allocates some new memory does not prevent the reordering.

I believe it is okay if fir.pack_array is reordered with respect of a fir.load %orig_base - the packing itself does not affect the value of the original array. Only the unpacking does.

Reordering of the array accesses that are located inside the pack/unpack region with respect to fir.pack_array is not possible due to SSA dominance.

So it seems MemAlloc + MemRead (on the original array) is enough for fir.pack_array.

An example for fir.unpack_array:

s1: %orig_box = ... %orig_base = fir.box_addr %orig_box %packed_box = fir.pack_array %orig_box ... fir.unpack_array %packed_box to %orig_box load/store through %orig_base

MemFree on fir.unpack_array is not enough to prevent moving a load/store through %orig_base across it, so fir.unpack_array needs MemFree + MemWrite (on the original array).

MemFree should prevent reordering fir.unpack_array with respect to any load/store located between pack/unpack.

Please let me know if it makes sense. I think it is an important point to settle.

Okay yes. I think I got confused because I took the idea of the aliasing too literally. It doesn't literally possibly alias because it a different array. But I agree that

fir.pack_array allocates, and it reads from the old array (which has to be a generic effect because of the indirection through a box)

fir.unpack_array deallocates, and writes to the original array (generic effect not just the box ssa value). I think it probably also should have MemRead because the read is from the box address not the box ssa value itself.

Thank you for taking the time to write that out in psudo-MLIR. It made it much easier to think about.

tblah · 2025-02-19T11:22:14Z

flang/docs/ArrayRepacking.md

+
+So the most conservative implementation of the predicate generator may be to always produce `false` value. Flang lowering may attach any number of such attributes to `fir.pack_array` depending on the compilation context and options.
+
+[TBD] define `TempCopyIsSafe` attribute interface so that OpenACC/OpenMP dialects can provide their specific attributes, which can be used to generate static/runtime checks for safety of the temporary copy in particular context.


As you pointed out earlier, what do we do about cases where the runtime is not going to be linked (or at least we don't know that it will be linked from the context of compiling this translation unit)? I guess this should just be best effort?

Right, we can only rely on OpenACC/OpenMP runtime support if the libraries are eventually linked. So we can only generate the extra stuff when OpenACC/OpenMP is enabled consistently for all compilation modules.

I tried to explain that in the Correctness requirements section with When array repacking is enabled, Flang should guarantee correct program behavior when OpenACC/OpenMP features are explicitly enabled during the compilation.

tblah · 2025-02-19T11:26:42Z

flang/docs/ArrayRepacking.md

+
+It is unsafe to create temporaries of assumed-shape dummy arrays that have `TARGET` attribute, because they can be accessed during the invocation of the subprograms not using direct reference of the dummy argument. Lowering must never produce the new operations for such dummy arguments. [TBD] a user option might be provided to override the default safe behavior.
+
+The copy creation is also restricted for `ASYNCHRONOUS` and `VOLATILE` arguments. Such dummy arguments might be changed during the execution of their subprogram in an unpredictable manner, so creating a copy for them might be incorrect (Fortran 2023, Note 5 of section 15.5.2.5).


I wonder if openmp/acc programs that are broken by adding array packing/unpacking are really correct without using these attributes. Particularly in the case where the multithreaded element is in a different translation unit (or even subroutine/function).

I think Example #2 below is correct without ASYNCHRONOUS/VOLATILE. So there are cases where repacking breaks it. Let me know if I misunderstood the comment, and you were talking about some other case.

razvanlupusoru

Very well thought out proposal! Thank you :)

razvanlupusoru · 2025-02-19T16:20:56Z

flang/docs/ArrayRepacking.md

+
+Alias analysis:
+
+* For the purpose of alias analysis `fir.pack_array` should be considered a pass-though operation, and its value should be treated as `MayAlias` with the original array.


I agree. Either the new array is a MustAlias (because the memory modifications to the copy were intended to be visible modifications to original array) or it is a NoAlias because we decide to model it as a copy. Because it is possible no copy is created, it seems that a MustAlias is probably the better choice. In which case we should not need the MemRead/MemWrite effects on the pack operation itself (we may still want some other resource though to capture the potential allocation though)

razvanlupusoru · 2025-02-19T16:44:40Z

flang/docs/ArrayRepacking.md

+
+The attribute interface will be used during lowering of `fir.pack_array` to generate the predicate and make the packing conditional and safe. This will allow applying repacking correctly in programs compiled with `-fopenmp`, and get the benefits of repacking in the serial parts of those programs.
+
+Similarly, the OpenACC MLIR dialect may provide such an attribute to check if a device copy has been created for an array that is about to be packed, and prevent the repacking for cases like this:


A couple of thoughts on this:

I agree about having the option to have a programming model guard to prevent packing/repacking.

For OpenACC, I do not want to avoid repacking. I believe for OpenACC I would like the following:

A flag in the array descriptor noting it is a "repacked" view - this way data mapping can effectively deal with such a case when present clause is involved.

API in MappableType interface which allows a dialect to provide information whether we have a re-layout copy.

I wonder what you have in mind for dealing with the present clause case - can you please elaborate?

On the descriptor flag, I guess it is not the only option. For example, if we let OpenACC runtime track the "association" between the original and the copy, it may keep this in the bookkeeping tables. But that depends on how you imagine to use this information (hence the question above). For example, we may define the interface such that it provides two methods to generate this:

bool accPredicate = canRepack_to_acc(base_addr, size); bool otherPredicates = ...; if (accPredicate && otherPredicates) { repacked = _FortranARepack(original); reportRepack_to_acc(repacked, original); }

I am also completely unsure what we can do about the second example below (with repacking_in and repacking_out). I think we cannot prevent repacking neither statically nor dynamically, so it will happen. It seems like the only option to fail gracefully is to report a runtime error at the point where the copy array is deallocated while still being present on the device.

So, as Tom said, it looks like we cannot always guarantee safety of the repacking, and can only try our best to reduce the probability of the error and report as many of them as possible (in the compiler and in the runtime).

vzakhari

Tom, Razvan, thank you a lot for reviewing this! I had to think a lot on my replies, so please excuse me for the delayed response :)

Overall, I still have lots of uncertainty about our ability to make the repacking reliable for OpenACC/OpenMP. So maybe you will agree, if I change the requirement on this point to something more relaxed, like "the compiler/runtime should do its best to minimize problems caused by repacking and report errors as early as possible, but there is no guarantee that repacking works for all programs that involve multithreading/offload".

vzakhari · 2025-02-21T17:21:06Z

flang/docs/ArrayRepacking.md

+* `innermost` - tells that the repacking has to be done iff the array is not contiguous in the innermost dimension. This also describes what type of continuity can be expected from `%new_var`, i.e. `innermost` means that the resulting array is definitely contiguous in the innermost dimension, but may be non-contiguous in other dimensions (unless additional analysis proves otherwise). For 1-D arrays, `innermost` attribute is not valid.
+* `no_copy` - indicates that, in case a temporary array is created, `%var` to `%new_var` copy is not required (`intent(out)` dummy argument case).
+* `heuristics`
+  * `loop-only` - `fir.pack_array` can be optimized away, if the array is not used in a loop.


A case when an array is not used in a loop but may still benefit from repacking is when an assumed-shape dummy array is passed to another subprogram as an assumed-shape dummy argument. It may be profitable to do the repacking early in some cases when HLFIR/FIR inlining is not enabled (for whatever reason), e.g.:

subroutine s1(x) real :: x(:) call s2(x) end subroutine s2(x) real :: x(:) do i=1,100 call s3(x) end end subroutine s3(x) real :: x(:) do i=1,1000 <use array x> end end

Postponing the dynamic repacking to subroutine s3 is suboptimal, because the repacking will be done 100 times. Repacking the array once in subroutine s1 is better.

With HLFIR/FIR inlining and further fir.pack_array optimizations this case should work well even with loop-only heuristic, because the optimization passes should figure out that the array is eventually used in a loop, and fir.pack_array's in subroutines s2 and s3 are redundant.

I would like to keep the flexibility for the users to pick different strategies for repacking depending on their needs and their ability to use inlining and other optimizations.

vzakhari · 2025-02-21T17:22:34Z

flang/docs/ArrayRepacking.md

+  * `max-size` - constant integer attribute specifying the maximum byte size of an array that is eligible for repacking.
+  * `max-element-size` - constant integer attribute specifying the maximum byte element-size of an array that is eligible for repacking.
+  * `min-stride` - constant integer attribute specifying the minimum byte stride of the innermost dimension of an array that is eligible for repacking.


Yes, I would like to provide the flexibility to choose different repacking settings for different compilation modules. Eventually, when cross-module HLFIR/FIR inlining is working, the operations from different modules may have different settings.

vzakhari · 2025-02-21T18:57:26Z

flang/docs/ArrayRepacking.md

+* If there is no `no_copy`:
+  * `MemRead` effect on unknown value to indicate potential read from the original array.
+    * [TBD] we can relax that by having an additional argument taking `fir.box_addr %var` value, though, this adds some redundancy to the argument list.
+  * `MemWrite` effect on unknown value to indicate potential write into the temporary array.


I think the alloc effect is needed. The most conservative way to lower a pair of pack_array/unpack_array that use the stack allocation is to introduce stacksave/stackrestore for them. If there is no alloc effect, then two pack_array ops may potentially be reordered by some pass like this:

%1 = fir.pack_array %0 = fir.pack_array fir.unpack_array %1 fir.unpack_array %0

This will result in incorrect stackrestore behavior.

Moreover, see comments below on the passes that may move operations around.

vzakhari · 2025-02-22T00:21:28Z

flang/docs/ArrayRepacking.md

+
+Alias analysis:
+
+* For the purpose of alias analysis `fir.pack_array` should be considered a pass-though operation, and its value should be treated as `MayAlias` with the original array.


Okay, I think it might be clearer here to just say that fir.pack_array should be considered a pass-through operation for the purpose of alias analysis, i.e. when looking for the source of a pointer the alias analysis should just pass-through from the result box to the source box and proceed as usual. The final MayAlias or MustAlias will depend if they arrive at the same source but with different address arithmetic in the middle. Does it sound correct?

I strongly believe MemRead/MemWrite effects are needed on the new operations.

Consider the following example, where the two subroutines are compiled with and w/o packing and FIR inlining happens:

subroutine s1(x) real :: x(:) // not packed x(1) = 1.0 call s2(x) // inlined end subroutine s2(x) real :: x(:) // packed print *, x(1) end

I will express it in this pseudo-MLIR:

s1: %orig_box = ... %orig_base = fir.box_addr %orig_box fir.store 1.0 to %orig_base %packed_box = fir.pack_array %orig_box %packed_base = fir.box_addr %packed_box %val = fir.load %packed_base print %val fir.unpack_array %packed_box to %orig_box

A pass cannot hoist fir.pack_array above the fir.store (or sink fir.store below fir.pack_array), so fir.pack_array needs MemRead effect on the original array. Note that MemAlloc effect is not enough here: if fir.store accesses a memory before fir.pack_array, then the fact that fir.pack_array allocates some new memory does not prevent the reordering.

I believe it is okay if fir.pack_array is reordered with respect of a fir.load %orig_base - the packing itself does not affect the value of the original array. Only the unpacking does.

Reordering of the array accesses that are located inside the pack/unpack region with respect to fir.pack_array is not possible due to SSA dominance.

So it seems MemAlloc + MemRead (on the original array) is enough for fir.pack_array.

An example for fir.unpack_array:

s1: %orig_box = ... %orig_base = fir.box_addr %orig_box %packed_box = fir.pack_array %orig_box ... fir.unpack_array %packed_box to %orig_box load/store through %orig_base

MemFree on fir.unpack_array is not enough to prevent moving a load/store through %orig_base across it, so fir.unpack_array needs MemFree + MemWrite (on the original array).

MemFree should prevent reordering fir.unpack_array with respect to any load/store located between pack/unpack.

Please let me know if it makes sense. I think it is an important point to settle.

vzakhari · 2025-02-22T01:25:54Z

flang/docs/ArrayRepacking.md

+
+The attribute interface will be used during lowering of `fir.pack_array` to generate the predicate and make the packing conditional and safe. This will allow applying repacking correctly in programs compiled with `-fopenmp`, and get the benefits of repacking in the serial parts of those programs.
+
+Similarly, the OpenACC MLIR dialect may provide such an attribute to check if a device copy has been created for an array that is about to be packed, and prevent the repacking for cases like this:


I wonder what you have in mind for dealing with the present clause case - can you please elaborate?

On the descriptor flag, I guess it is not the only option. For example, if we let OpenACC runtime track the "association" between the original and the copy, it may keep this in the bookkeeping tables. But that depends on how you imagine to use this information (hence the question above). For example, we may define the interface such that it provides two methods to generate this:

bool accPredicate = canRepack_to_acc(base_addr, size); bool otherPredicates = ...; if (accPredicate && otherPredicates) { repacked = _FortranARepack(original); reportRepack_to_acc(repacked, original); }

I am also completely unsure what we can do about the second example below (with repacking_in and repacking_out). I think we cannot prevent repacking neither statically nor dynamically, so it will happen. It seems like the only option to fail gracefully is to report a runtime error at the point where the copy array is deallocated while still being present on the device.

So, as Tom said, it looks like we cannot always guarantee safety of the repacking, and can only try our best to reduce the probability of the error and report as many of them as possible (in the compiler and in the runtime).

vzakhari · 2025-02-22T01:28:19Z

flang/docs/ArrayRepacking.md

+
+So the most conservative implementation of the predicate generator may be to always produce `false` value. Flang lowering may attach any number of such attributes to `fir.pack_array` depending on the compilation context and options.
+
+[TBD] define `TempCopyIsSafe` attribute interface so that OpenACC/OpenMP dialects can provide their specific attributes, which can be used to generate static/runtime checks for safety of the temporary copy in particular context.


Right, we can only rely on OpenACC/OpenMP runtime support if the libraries are eventually linked. So we can only generate the extra stuff when OpenACC/OpenMP is enabled consistently for all compilation modules.

I tried to explain that in the Correctness requirements section with When array repacking is enabled, Flang should guarantee correct program behavior when OpenACC/OpenMP features are explicitly enabled during the compilation.

vzakhari · 2025-02-22T01:29:59Z

flang/docs/ArrayRepacking.md

+
+It is unsafe to create temporaries of assumed-shape dummy arrays that have `TARGET` attribute, because they can be accessed during the invocation of the subprograms not using direct reference of the dummy argument. Lowering must never produce the new operations for such dummy arguments. [TBD] a user option might be provided to override the default safe behavior.
+
+The copy creation is also restricted for `ASYNCHRONOUS` and `VOLATILE` arguments. Such dummy arguments might be changed during the execution of their subprogram in an unpredictable manner, so creating a copy for them might be incorrect (Fortran 2023, Note 5 of section 15.5.2.5).


I think Example #2 below is correct without ASYNCHRONOUS/VOLATILE. So there are cases where repacking breaks it. Let me know if I misunderstood the comment, and you were talking about some other case.

tblah · 2025-02-25T10:39:16Z

Overall, I still have lots of uncertainty about our ability to make the repacking reliable for OpenACC/OpenMP. So maybe you will agree, if I change the requirement on this point to something more relaxed, like "the compiler/runtime should do its best to minimize problems caused by repacking and report errors as early as possible, but there is no guarantee that repacking works for all programs that involve multithreading/offload".

Relaxing the requirement is fine by me, especially while this is being actively worked on.

As I understand the correctness problem in a multithreaded context, your concern is that the packing and unpacking can change the order in which memory effects are observed between threads, and unpacking might overwrite memory writes made by another thread.

However, I think the only criteria that is required for correctness is that none of the following exist between the pack and unpack operations:

Memory barriers
Other operations which carry an implicit memory barrier
Atomic operations

I think that without any of these, the ordering of memory accesses between threads is not at all constrained and so any program depending upon them is exhibiting undefined behavior.

For example, consider something like this

integer :: counter = 0
integer :: temporary
!$omp parallel num_threads(100) shared(counter) private(temporary)
  temporary = counter
  temporary = temporary + 1
  counter = temporary
!$omp end parallel

In the absence of any locking, this will give quite varying results depending on exactly when each thread is scheduled. I think your example is similar to this.

vzakhari · 2025-02-27T00:54:29Z

Tom, I think a valid program may process different parts of the same array in different threads (such as in example #2 in the doc) without any synchronization. The packing/unpacking breaks such a correct program just because it reads/modifies the whole array vs only the part that is supposed to be processed by the thread where pack/unpack is executed. And there is no way for packing/unpacking to know which part of the array needs to be copied.

tblah · 2025-02-27T09:50:27Z

I agree they can access the same array in different threads, what I mean is that without synchronization or atomic access, the results of doing so are not well defined.

But if you would prefer to be more conservative with the pass that is fine by me as well.

tblah · 2025-03-03T17:08:59Z

Having thought about this some more today, I now agree with you Slava. Previously I was thinking about the array accesses as concurrent accesses to the same variable from multiple threads. But I see you're right to think of each array element as a different variable because the indices can be organized (as in your example) to ensure that there are no concurrent accesses to the same memory address from different threads.

Thank you for baring with me as I thought this through.

I still think my previous suggestions are useful criteria to check for, although I now see that this is not sufficient on its own. We must not perform array repacking if there are

Memory barriers
Other operations which carry an implicit memory barrier
Atomic operations

Between the pack and unpack operations. I suppose this implies that the packed/repacked array isn't passed as an argument to any function calls that might contain any of the above operations (although as we can't ensure correctness in general, I would be willing to relax the function calls requirement if that gets in the way of optimizing parituclar application or benchmark kernels).

It is okay if you don't include these extra checks in any initial implementation, but if you agree, maybe this should be included in the design document.

vzakhari · 2025-03-06T19:15:43Z

Razvan, Tom, thank you again for the detailed review!

I added a section where I tried to track the ideas that were proposed. Please let me know if I missed something or can improve something.

I am comfortable with the document, and I will proceed to the implementation. I will be refining the document (where I left TBD notes) along the way.

As defined in llvm#127147.

As defined in #127147.

I am working on `-frepack-array` feature (llvm#127147), which produces non-trivial manipulations with arguments of `fir.declare`. In this case, we end up with CFG computation of the `fir.declare` argument, and AddDebugInfo pass incorrectly mapped two dummy arguments to the same arg index in the debug attributes. This patch makes sure that we assign the arg index only if we can prove that we've traced the block argument to the function's entry block. I believe this problem is not specific to `-frepack-arrays`, e.g. it may appear due to MLIR inlining as well.

…fo. (#131672) I am working on `-frepack-array` feature (#127147), which produces non-trivial manipulations with arguments of `fir.declare`. In this case, we end up with CFG computation of the `fir.declare` argument, and AddDebugInfo pass incorrectly mapped two dummy arguments to the same arg index in the debug attributes. This patch makes sure that we assign the arg index only if we can prove that we've traced the block argument to the function's entry block. I believe this problem is not specific to `-frepack-arrays`, e.g. it may appear due to MLIR inlining as well.

`fir.pack_array` is just a pass-through op for the process of finding the source in FIR alias analysis (as defined in llvm#127147).

`fir.pack_array` is just a pass-through op for the process of finding the source in FIR alias analysis (as defined in #127147).

`fir.pack_array` is just a pass-through op for the process of finding the source in FIR alias analysis (as defined in llvm#127147).

vzakhari requested review from razvanlupusoru, tblah, kiranchandramohan, abidh and jeanPerier February 13, 2025 23:37

llvmbot added the flang Flang issues not falling into any other category label Feb 13, 2025

kiranchandramohan requested review from skatrak, agozillon and TIFitis February 16, 2025 10:36

vzakhari added 2 commits February 18, 2025 11:59

Added section about interactions between array repacking and loop ver…

d6c2643

…sioning.

Added reference to index.

eec4dce

Fixed cross-reference.

1ac2246

tblah reviewed Feb 19, 2025

View reviewed changes

razvanlupusoru reviewed Feb 19, 2025

View reviewed changes

vzakhari commented Feb 22, 2025

View reviewed changes

Minor updates after the review discussions.

98ff5e9

Added a section with ideas expressed during the review.

95f97ea

vzakhari merged commit aa38754 into llvm:main Mar 10, 2025
12 checks passed

vzakhari added a commit to vzakhari/llvm-project that referenced this pull request Mar 11, 2025

[flang] Definitions of fir.pack/unpack_array operations.

7792a63

As defined in llvm#127147.

vzakhari mentioned this pull request Mar 11, 2025

[flang] Definitions of fir.pack/unpack_array operations. #130698

Merged

vzakhari added a commit to vzakhari/llvm-project that referenced this pull request Mar 11, 2025

[flang] Definitions of fir.pack/unpack_array operations.

d71f1ff

As defined in llvm#127147.

vzakhari added a commit that referenced this pull request Mar 11, 2025

[flang] Definitions of fir.pack/unpack_array operations. (#130698)

74eba97

As defined in #127147.

vzakhari mentioned this pull request Mar 17, 2025

[flang] Fixed computation of position of function's arg in AddDebugInfo. #131672

Merged

vzakhari added a commit to vzakhari/llvm-project that referenced this pull request Mar 19, 2025

[flang] Support fir.pack_array in FIR alias analysis.

e125dcd

`fir.pack_array` is just a pass-through op for the process of finding the source in FIR alias analysis (as defined in llvm#127147).

vzakhari mentioned this pull request Mar 19, 2025

[flang] Support fir.pack_array in FIR alias analysis. #131946

Merged

vzakhari added a commit that referenced this pull request Apr 21, 2025

[flang] Support fir.pack_array in FIR alias analysis. (#131946)

86a0336

`fir.pack_array` is just a pass-through op for the process of finding the source in FIR alias analysis (as defined in #127147).


		Alias analysis:

		* For the purpose of alias analysis `fir.pack_array` should be considered a pass-though operation, and its value should be treated as `MayAlias` with the original array.


		So the most conservative implementation of the predicate generator may be to always produce `false` value. Flang lowering may attach any number of such attributes to `fir.pack_array` depending on the compilation context and options.

		[TBD] define `TempCopyIsSafe` attribute interface so that OpenACC/OpenMP dialects can provide their specific attributes, which can be used to generate static/runtime checks for safety of the temporary copy in particular context.


		It is unsafe to create temporaries of assumed-shape dummy arrays that have `TARGET` attribute, because they can be accessed during the invocation of the subprograms not using direct reference of the dummy argument. Lowering must never produce the new operations for such dummy arguments. [TBD] a user option might be provided to override the default safe behavior.

		The copy creation is also restricted for `ASYNCHRONOUS` and `VOLATILE` arguments. Such dummy arguments might be changed during the execution of their subprogram in an unpredictable manner, so creating a copy for them might be incorrect (Fortran 2023, Note 5 of section 15.5.2.5).


		The attribute interface will be used during lowering of `fir.pack_array` to generate the predicate and make the packing conditional and safe. This will allow applying repacking correctly in programs compiled with `-fopenmp`, and get the benefits of repacking in the serial parts of those programs.

		Similarly, the OpenACC MLIR dialect may provide such an attribute to check if a device copy has been created for an array that is about to be packed, and prevent the repacking for cases like this:

[RFC][flang] Add support for assumed-shape dummy arrays repacking. #127147

[RFC][flang] Add support for assumed-shape dummy arrays repacking. #127147

Conversation

vzakhari commented Feb 13, 2025

kiranchandramohan commented Feb 16, 2025

vzakhari commented Feb 18, 2025

tblah left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

razvanlupusoru left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

vzakhari left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

tblah commented Feb 25, 2025 • edited Loading

vzakhari commented Feb 27, 2025

tblah commented Feb 27, 2025

tblah commented Mar 3, 2025 • edited Loading

vzakhari commented Mar 6, 2025

tblah commented Feb 25, 2025 •

edited

Loading

tblah commented Mar 3, 2025 •

edited

Loading