Skip to content

[RFC][flang] Add support for assumed-shape dummy arrays repacking. #127147

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 6 commits into from
Mar 10, 2025

Conversation

vzakhari
Copy link
Contributor

This is a document describing why and how to add support for repacking
of assumed-shape dummy arrrays to provide more efficient data cache.
It proposes adding new FIR operations and outlines the compiler flow
handling these operations.
I would like to hear feedback on all of it, but especially on:

  • The possibility of detecting safeness of the repacking
    in the context of OpenACC/OpenMP. If it is not possible
    to do the runtime checks to determine safety, then
    there is not need to add the TempCopyIsSafe attributes
    to the instruction.
  • Whether it is possible to preserve the debug information
    in cases where fir.pack_array is sunk after [hl]fir.declare,
    so that before the fir.pack_array a debugger will refer
    to the values in the original array, and after fir.pack_array
    it will refer to the copy.

This is a document describing why and how to add support for repacking
of assumed-shape dummy arrrays to provide more efficient data cache.
It proposes adding new FIR operations and outlines the compiler flow
handling these operations.
I would like to hear feedback on all of it, but especially on:
  * The possibility of detecting safeness of the repacking
    in the context of OpenACC/OpenMP. If it is not possible
    to do the runtime checks to determine safety, then
    there is not need to add the `TempCopyIsSafe` attributes
    to the instruction.
  * Whether it is possible to preserve the debug information
    in cases where `fir.pack_array` is sunk after `[hl]fir.declare`,
    so that before the `fir.pack_array` a debugger will refer
    to the values in the original array, and after `fir.pack_array`
    it will refer to the copy.
@llvmbot llvmbot added the flang Flang issues not falling into any other category label Feb 13, 2025
@kiranchandramohan
Copy link
Contributor

Thanks @vzakhari for the detailed document about repacking arrays.

We implemented a loop versioning pass (https://discourse.llvm.org/t/rfc-loop-versioning-for-unit-stride/68605, https://flang.llvm.org/docs/FlangCommandLineReference.html#cmdoption-flang-fversion-loops-for-stride) that versions the loop for better cache locality, prevent scatter/gathers etc. We get around 36% improvement in the spec2017 roms benchmark due to this pass. Since this pass serves a similar usecase as repack arrays, the pass can be switched OFF when 'repack-arrays' is ON.

BTW, Is the performance data that you collected with the loop versioning pass ON? Or is it purely artificial?

There were some comments about issues with versioning/repacking with the asynchronous attribute and the contiguous attribute. It might be good to add these points to the document. https://discourse.llvm.org/t/transformations-to-aid-optimizer-for-subroutines-functions-with-assumed-shape-arguments/66447

I have added AMD engineers working on offloading to have a look for the offloading questions.

@vzakhari
Copy link
Contributor Author

Thanks @vzakhari for the detailed document about repacking arrays.

Thank you for reading, Kiran!

We implemented a loop versioning pass (https://discourse.llvm.org/t/rfc-loop-versioning-for-unit-stride/68605, https://flang.llvm.org/docs/FlangCommandLineReference.html#cmdoption-flang-fversion-loops-for-stride) that versions the loop for better cache locality, prevent scatter/gathers etc. We get around 36% improvement in the spec2017 roms benchmark due to this pass. Since this pass serves a similar usecase as repack arrays, the pass can be switched OFF when 'repack-arrays' is ON.

I think we may get better results with both array repacking and the loop versioning. I added a section about this into the document.

BTW, Is the performance data that you collected with the loop versioning pass ON? Or is it purely artificial?

I collected the performance numbers using gfortran, so they are not artificial :) Flang does not have array repacking right now, but the manual REPACKING version shows the same speed-up as gfortran's array repacking. The times are the same regardless of the loop versioning. I believe in capacita case none of the compilers vectorizes the hot loop, because of unknown data dependencies, so all the speed-up from the array repacking is coming from better data cache utilization.

There were some comments about issues with versioning/repacking with the asynchronous attribute and the contiguous attribute. It might be good to add these points to the document. https://discourse.llvm.org/t/transformations-to-aid-optimizer-for-subroutines-functions-with-assumed-shape-arguments/66447

Thanks for the link! The document covers ASYNCHRONOUS and VOLATILE cases, and it also mentions that the array repacking should be applied to the dummy arguments that are CONTIGUOUS.

I have added AMD engineers working on offloading to have a look for the offloading questions.

Thanks!

Copy link
Contributor

@tblah tblah left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for all of your work so far on this proposal. I like the design and think it will be a lot easier to extend than the existing loop versioning pass.

You raise a very interesting point with the visibility of side effects to array variables accessed via multiple threads. I agree making this change could change the result of such programs, but I wonder if the ordering of the visibility of different array accesses between threads can be depended on by conformant programs in the absence of memory barriers, ASYNCHRONOUS, or VOLATILE. You are right that we cannot know inside of a function whether threading is enabled, but if the visibility of memory accesses is only guaranteed by barriers, we can inspect and see whether any memory barriers are present inside of this function (doing this might need new dialect interfaces).

Other than shared variables, another possibility in openmp is that array variables are privatized. Then a copy is made of the source variable already. This is another case to be considered in the optimization passes you mentioned.

* `innermost` - tells that the repacking has to be done iff the array is not contiguous in the innermost dimension. This also describes what type of continuity can be expected from `%new_var`, i.e. `innermost` means that the resulting array is definitely contiguous in the innermost dimension, but may be non-contiguous in other dimensions (unless additional analysis proves otherwise). For 1-D arrays, `innermost` attribute is not valid.
* `no_copy` - indicates that, in case a temporary array is created, `%var` to `%new_var` copy is not required (`intent(out)` dummy argument case).
* `heuristics`
* `loop-only` - `fir.pack_array` can be optimized away, if the array is not used in a loop.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

When would this not be true?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

A case when an array is not used in a loop but may still benefit from repacking is when an assumed-shape dummy array is passed to another subprogram as an assumed-shape dummy argument. It may be profitable to do the repacking early in some cases when HLFIR/FIR inlining is not enabled (for whatever reason), e.g.:

subroutine s1(x)
  real :: x(:)
  call s2(x)
end

subroutine s2(x)
  real :: x(:)
  do i=1,100
    call s3(x)
  end
end

subroutine s3(x)
  real :: x(:)
  do i=1,1000
    <use array x>
  end
end

Postponing the dynamic repacking to subroutine s3 is suboptimal, because the repacking will be done 100 times. Repacking the array once in subroutine s1 is better.

With HLFIR/FIR inlining and further fir.pack_array optimizations this case should work well even with loop-only heuristic, because the optimization passes should figure out that the array is eventually used in a loop, and fir.pack_array's in subroutines s2 and s3 are redundant.

I would like to keep the flexibility for the users to pick different strategies for repacking depending on their needs and their ability to use inlining and other optimizations.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sounds good to me. Thank you for the detailed explanation!

Comment on lines +202 to +204
* `max-size` - constant integer attribute specifying the maximum byte size of an array that is eligible for repacking.
* `max-element-size` - constant integer attribute specifying the maximum byte element-size of an array that is eligible for repacking.
* `min-stride` - constant integer attribute specifying the minimum byte stride of the innermost dimension of an array that is eligible for repacking.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do you anticipate setting these differently for different pack/unpack operations? I would have thought these could be expressed more simply as a pass parameter.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, I would like to provide the flexibility to choose different repacking settings for different compilation modules. Eventually, when cross-module HLFIR/FIR inlining is working, the operations from different modules may have different settings.

* If there is no `no_copy`:
* `MemRead` effect on unknown value to indicate potential read from the original array.
* [TBD] we can relax that by having an additional argument taking `fir.box_addr %var` value, though, this adds some redundancy to the argument list.
* `MemWrite` effect on unknown value to indicate potential write into the temporary array.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think we need memory effects to model that this operation modifies the SSA value that it creates.

I understand that it might be creating a descriptor and the write is actually in the allocated memory, but I think that still counts as initializing the descriptor in this case. For example, the initialization of the allocated memory is dead if the ssa value for the descriptor is dead.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think the alloc effect is needed. The most conservative way to lower a pair of pack_array/unpack_array that use the stack allocation is to introduce stacksave/stackrestore for them. If there is no alloc effect, then two pack_array ops may potentially be reordered by some pass like this:

%1 = fir.pack_array
%0 = fir.pack_array
fir.unpack_array %1
fir.unpack_array %0

This will result in incorrect stackrestore behavior.

Moreover, see comments below on the passes that may move operations around.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This makes sense. Thank you for explaining.


Alias analysis:

* For the purpose of alias analysis `fir.pack_array` should be considered a pass-though operation, and its value should be treated as `MayAlias` with the original array.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this is too pessimistic. I think this can be modeled as a new allocation.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I agree. Either the new array is a MustAlias (because the memory modifications to the copy were intended to be visible modifications to original array) or it is a NoAlias because we decide to model it as a copy. Because it is possible no copy is created, it seems that a MustAlias is probably the better choice. In which case we should not need the MemRead/MemWrite effects on the pack operation itself (we may still want some other resource though to capture the potential allocation though)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Okay, I think it might be clearer here to just say that fir.pack_array should be considered a pass-through operation for the purpose of alias analysis, i.e. when looking for the source of a pointer the alias analysis should just pass-through from the result box to the source box and proceed as usual. The final MayAlias or MustAlias will depend if they arrive at the same source but with different address arithmetic in the middle. Does it sound correct?

I strongly believe MemRead/MemWrite effects are needed on the new operations.

Consider the following example, where the two subroutines are compiled with and w/o packing and FIR inlining happens:

subroutine s1(x)
  real :: x(:) // not packed
  x(1) = 1.0
  call s2(x) // inlined
end
subroutine s2(x)
  real :: x(:) // packed
  print *, x(1)
end

I will express it in this pseudo-MLIR:

s1:
  %orig_box = ...
  %orig_base = fir.box_addr %orig_box
  fir.store 1.0 to %orig_base
  %packed_box = fir.pack_array %orig_box
  %packed_base = fir.box_addr %packed_box
  %val = fir.load %packed_base
  print %val
  fir.unpack_array %packed_box to %orig_box

A pass cannot hoist fir.pack_array above the fir.store (or sink fir.store below fir.pack_array), so fir.pack_array needs MemRead effect on the original array. Note that MemAlloc effect is not enough here: if fir.store accesses a memory before fir.pack_array, then the fact that fir.pack_array allocates some new memory does not prevent the reordering.

I believe it is okay if fir.pack_array is reordered with respect of a fir.load %orig_base - the packing itself does not affect the value of the original array. Only the unpacking does.

Reordering of the array accesses that are located inside the pack/unpack region with respect to fir.pack_array is not possible due to SSA dominance.

So it seems MemAlloc + MemRead (on the original array) is enough for fir.pack_array.

An example for fir.unpack_array:

s1:
  %orig_box = ...
  %orig_base = fir.box_addr %orig_box
  %packed_box = fir.pack_array %orig_box
  ...
  fir.unpack_array %packed_box to %orig_box
  load/store through %orig_base

MemFree on fir.unpack_array is not enough to prevent moving a load/store through %orig_base across it, so fir.unpack_array needs MemFree + MemWrite (on the original array).

MemFree should prevent reordering fir.unpack_array with respect to any load/store located between pack/unpack.

Please let me know if it makes sense. I think it is an important point to settle.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Okay yes. I think I got confused because I took the idea of the aliasing too literally. It doesn't literally possibly alias because it a different array. But I agree that

  • fir.pack_array allocates, and it reads from the old array (which has to be a generic effect because of the indirection through a box)
  • fir.unpack_array deallocates, and writes to the original array (generic effect not just the box ssa value). I think it probably also should have MemRead because the read is from the box address not the box ssa value itself.

Thank you for taking the time to write that out in psudo-MLIR. It made it much easier to think about.


So the most conservative implementation of the predicate generator may be to always produce `false` value. Flang lowering may attach any number of such attributes to `fir.pack_array` depending on the compilation context and options.

[TBD] define `TempCopyIsSafe` attribute interface so that OpenACC/OpenMP dialects can provide their specific attributes, which can be used to generate static/runtime checks for safety of the temporary copy in particular context.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As you pointed out earlier, what do we do about cases where the runtime is not going to be linked (or at least we don't know that it will be linked from the context of compiling this translation unit)? I guess this should just be best effort?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Right, we can only rely on OpenACC/OpenMP runtime support if the libraries are eventually linked. So we can only generate the extra stuff when OpenACC/OpenMP is enabled consistently for all compilation modules.

I tried to explain that in the Correctness requirements section with When array repacking is enabled, Flang should guarantee correct program behavior when OpenACC/OpenMP features are explicitly enabled during the compilation.


It is unsafe to create temporaries of assumed-shape dummy arrays that have `TARGET` attribute, because they can be accessed during the invocation of the subprograms not using direct reference of the dummy argument. Lowering must never produce the new operations for such dummy arguments. [TBD] a user option might be provided to override the default safe behavior.

The copy creation is also restricted for `ASYNCHRONOUS` and `VOLATILE` arguments. Such dummy arguments might be changed during the execution of their subprogram in an unpredictable manner, so creating a copy for them might be incorrect (Fortran 2023, Note 5 of section 15.5.2.5).
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I wonder if openmp/acc programs that are broken by adding array packing/unpacking are really correct without using these attributes. Particularly in the case where the multithreaded element is in a different translation unit (or even subroutine/function).

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think Example #2 below is correct without ASYNCHRONOUS/VOLATILE. So there are cases where repacking breaks it. Let me know if I misunderstood the comment, and you were talking about some other case.

Copy link
Contributor

@razvanlupusoru razvanlupusoru left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Very well thought out proposal! Thank you :)


Alias analysis:

* For the purpose of alias analysis `fir.pack_array` should be considered a pass-though operation, and its value should be treated as `MayAlias` with the original array.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I agree. Either the new array is a MustAlias (because the memory modifications to the copy were intended to be visible modifications to original array) or it is a NoAlias because we decide to model it as a copy. Because it is possible no copy is created, it seems that a MustAlias is probably the better choice. In which case we should not need the MemRead/MemWrite effects on the pack operation itself (we may still want some other resource though to capture the potential allocation though)


The attribute interface will be used during lowering of `fir.pack_array` to generate the predicate and make the packing conditional and safe. This will allow applying repacking correctly in programs compiled with `-fopenmp`, and get the benefits of repacking in the serial parts of those programs.

Similarly, the OpenACC MLIR dialect may provide such an attribute to check if a device copy has been created for an array that is about to be packed, and prevent the repacking for cases like this:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

A couple of thoughts on this:

  • I agree about having the option to have a programming model guard to prevent packing/repacking.
  • For OpenACC, I do not want to avoid repacking. I believe for OpenACC I would like the following:
  • A flag in the array descriptor noting it is a "repacked" view - this way data mapping can effectively deal with such a case when present clause is involved.
  • API in MappableType interface which allows a dialect to provide information whether we have a re-layout copy.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I wonder what you have in mind for dealing with the present clause case - can you please elaborate?

On the descriptor flag, I guess it is not the only option. For example, if we let OpenACC runtime track the "association" between the original and the copy, it may keep this in the bookkeeping tables. But that depends on how you imagine to use this information (hence the question above). For example, we may define the interface such that it provides two methods to generate this:

 bool accPredicate = canRepack_to_acc(base_addr, size);
 bool otherPredicates = ...;
  if (accPredicate && otherPredicates) {
    repacked = _FortranARepack(original);
    reportRepack_to_acc(repacked, original);
  }

I am also completely unsure what we can do about the second example below (with repacking_in and repacking_out). I think we cannot prevent repacking neither statically nor dynamically, so it will happen. It seems like the only option to fail gracefully is to report a runtime error at the point where the copy array is deallocated while still being present on the device.

So, as Tom said, it looks like we cannot always guarantee safety of the repacking, and can only try our best to reduce the probability of the error and report as many of them as possible (in the compiler and in the runtime).

Copy link
Contributor Author

@vzakhari vzakhari left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Tom, Razvan, thank you a lot for reviewing this! I had to think a lot on my replies, so please excuse me for the delayed response :)

Overall, I still have lots of uncertainty about our ability to make the repacking reliable for OpenACC/OpenMP. So maybe you will agree, if I change the requirement on this point to something more relaxed, like "the compiler/runtime should do its best to minimize problems caused by repacking and report errors as early as possible, but there is no guarantee that repacking works for all programs that involve multithreading/offload".

* `innermost` - tells that the repacking has to be done iff the array is not contiguous in the innermost dimension. This also describes what type of continuity can be expected from `%new_var`, i.e. `innermost` means that the resulting array is definitely contiguous in the innermost dimension, but may be non-contiguous in other dimensions (unless additional analysis proves otherwise). For 1-D arrays, `innermost` attribute is not valid.
* `no_copy` - indicates that, in case a temporary array is created, `%var` to `%new_var` copy is not required (`intent(out)` dummy argument case).
* `heuristics`
* `loop-only` - `fir.pack_array` can be optimized away, if the array is not used in a loop.
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

A case when an array is not used in a loop but may still benefit from repacking is when an assumed-shape dummy array is passed to another subprogram as an assumed-shape dummy argument. It may be profitable to do the repacking early in some cases when HLFIR/FIR inlining is not enabled (for whatever reason), e.g.:

subroutine s1(x)
  real :: x(:)
  call s2(x)
end

subroutine s2(x)
  real :: x(:)
  do i=1,100
    call s3(x)
  end
end

subroutine s3(x)
  real :: x(:)
  do i=1,1000
    <use array x>
  end
end

Postponing the dynamic repacking to subroutine s3 is suboptimal, because the repacking will be done 100 times. Repacking the array once in subroutine s1 is better.

With HLFIR/FIR inlining and further fir.pack_array optimizations this case should work well even with loop-only heuristic, because the optimization passes should figure out that the array is eventually used in a loop, and fir.pack_array's in subroutines s2 and s3 are redundant.

I would like to keep the flexibility for the users to pick different strategies for repacking depending on their needs and their ability to use inlining and other optimizations.

Comment on lines +202 to +204
* `max-size` - constant integer attribute specifying the maximum byte size of an array that is eligible for repacking.
* `max-element-size` - constant integer attribute specifying the maximum byte element-size of an array that is eligible for repacking.
* `min-stride` - constant integer attribute specifying the minimum byte stride of the innermost dimension of an array that is eligible for repacking.
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, I would like to provide the flexibility to choose different repacking settings for different compilation modules. Eventually, when cross-module HLFIR/FIR inlining is working, the operations from different modules may have different settings.

* If there is no `no_copy`:
* `MemRead` effect on unknown value to indicate potential read from the original array.
* [TBD] we can relax that by having an additional argument taking `fir.box_addr %var` value, though, this adds some redundancy to the argument list.
* `MemWrite` effect on unknown value to indicate potential write into the temporary array.
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think the alloc effect is needed. The most conservative way to lower a pair of pack_array/unpack_array that use the stack allocation is to introduce stacksave/stackrestore for them. If there is no alloc effect, then two pack_array ops may potentially be reordered by some pass like this:

%1 = fir.pack_array
%0 = fir.pack_array
fir.unpack_array %1
fir.unpack_array %0

This will result in incorrect stackrestore behavior.

Moreover, see comments below on the passes that may move operations around.


Alias analysis:

* For the purpose of alias analysis `fir.pack_array` should be considered a pass-though operation, and its value should be treated as `MayAlias` with the original array.
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Okay, I think it might be clearer here to just say that fir.pack_array should be considered a pass-through operation for the purpose of alias analysis, i.e. when looking for the source of a pointer the alias analysis should just pass-through from the result box to the source box and proceed as usual. The final MayAlias or MustAlias will depend if they arrive at the same source but with different address arithmetic in the middle. Does it sound correct?

I strongly believe MemRead/MemWrite effects are needed on the new operations.

Consider the following example, where the two subroutines are compiled with and w/o packing and FIR inlining happens:

subroutine s1(x)
  real :: x(:) // not packed
  x(1) = 1.0
  call s2(x) // inlined
end
subroutine s2(x)
  real :: x(:) // packed
  print *, x(1)
end

I will express it in this pseudo-MLIR:

s1:
  %orig_box = ...
  %orig_base = fir.box_addr %orig_box
  fir.store 1.0 to %orig_base
  %packed_box = fir.pack_array %orig_box
  %packed_base = fir.box_addr %packed_box
  %val = fir.load %packed_base
  print %val
  fir.unpack_array %packed_box to %orig_box

A pass cannot hoist fir.pack_array above the fir.store (or sink fir.store below fir.pack_array), so fir.pack_array needs MemRead effect on the original array. Note that MemAlloc effect is not enough here: if fir.store accesses a memory before fir.pack_array, then the fact that fir.pack_array allocates some new memory does not prevent the reordering.

I believe it is okay if fir.pack_array is reordered with respect of a fir.load %orig_base - the packing itself does not affect the value of the original array. Only the unpacking does.

Reordering of the array accesses that are located inside the pack/unpack region with respect to fir.pack_array is not possible due to SSA dominance.

So it seems MemAlloc + MemRead (on the original array) is enough for fir.pack_array.

An example for fir.unpack_array:

s1:
  %orig_box = ...
  %orig_base = fir.box_addr %orig_box
  %packed_box = fir.pack_array %orig_box
  ...
  fir.unpack_array %packed_box to %orig_box
  load/store through %orig_base

MemFree on fir.unpack_array is not enough to prevent moving a load/store through %orig_base across it, so fir.unpack_array needs MemFree + MemWrite (on the original array).

MemFree should prevent reordering fir.unpack_array with respect to any load/store located between pack/unpack.

Please let me know if it makes sense. I think it is an important point to settle.


The attribute interface will be used during lowering of `fir.pack_array` to generate the predicate and make the packing conditional and safe. This will allow applying repacking correctly in programs compiled with `-fopenmp`, and get the benefits of repacking in the serial parts of those programs.

Similarly, the OpenACC MLIR dialect may provide such an attribute to check if a device copy has been created for an array that is about to be packed, and prevent the repacking for cases like this:
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I wonder what you have in mind for dealing with the present clause case - can you please elaborate?

On the descriptor flag, I guess it is not the only option. For example, if we let OpenACC runtime track the "association" between the original and the copy, it may keep this in the bookkeeping tables. But that depends on how you imagine to use this information (hence the question above). For example, we may define the interface such that it provides two methods to generate this:

 bool accPredicate = canRepack_to_acc(base_addr, size);
 bool otherPredicates = ...;
  if (accPredicate && otherPredicates) {
    repacked = _FortranARepack(original);
    reportRepack_to_acc(repacked, original);
  }

I am also completely unsure what we can do about the second example below (with repacking_in and repacking_out). I think we cannot prevent repacking neither statically nor dynamically, so it will happen. It seems like the only option to fail gracefully is to report a runtime error at the point where the copy array is deallocated while still being present on the device.

So, as Tom said, it looks like we cannot always guarantee safety of the repacking, and can only try our best to reduce the probability of the error and report as many of them as possible (in the compiler and in the runtime).


So the most conservative implementation of the predicate generator may be to always produce `false` value. Flang lowering may attach any number of such attributes to `fir.pack_array` depending on the compilation context and options.

[TBD] define `TempCopyIsSafe` attribute interface so that OpenACC/OpenMP dialects can provide their specific attributes, which can be used to generate static/runtime checks for safety of the temporary copy in particular context.
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Right, we can only rely on OpenACC/OpenMP runtime support if the libraries are eventually linked. So we can only generate the extra stuff when OpenACC/OpenMP is enabled consistently for all compilation modules.

I tried to explain that in the Correctness requirements section with When array repacking is enabled, Flang should guarantee correct program behavior when OpenACC/OpenMP features are explicitly enabled during the compilation.


It is unsafe to create temporaries of assumed-shape dummy arrays that have `TARGET` attribute, because they can be accessed during the invocation of the subprograms not using direct reference of the dummy argument. Lowering must never produce the new operations for such dummy arguments. [TBD] a user option might be provided to override the default safe behavior.

The copy creation is also restricted for `ASYNCHRONOUS` and `VOLATILE` arguments. Such dummy arguments might be changed during the execution of their subprogram in an unpredictable manner, so creating a copy for them might be incorrect (Fortran 2023, Note 5 of section 15.5.2.5).
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think Example #2 below is correct without ASYNCHRONOUS/VOLATILE. So there are cases where repacking breaks it. Let me know if I misunderstood the comment, and you were talking about some other case.

@tblah
Copy link
Contributor

tblah commented Feb 25, 2025

Overall, I still have lots of uncertainty about our ability to make the repacking reliable for OpenACC/OpenMP. So maybe you will agree, if I change the requirement on this point to something more relaxed, like "the compiler/runtime should do its best to minimize problems caused by repacking and report errors as early as possible, but there is no guarantee that repacking works for all programs that involve multithreading/offload".

Relaxing the requirement is fine by me, especially while this is being actively worked on.

As I understand the correctness problem in a multithreaded context, your concern is that the packing and unpacking can change the order in which memory effects are observed between threads, and unpacking might overwrite memory writes made by another thread.

However, I think the only criteria that is required for correctness is that none of the following exist between the pack and unpack operations:

  • Memory barriers
  • Other operations which carry an implicit memory barrier
  • Atomic operations

I think that without any of these, the ordering of memory accesses between threads is not at all constrained and so any program depending upon them is exhibiting undefined behavior.

For example, consider something like this

integer :: counter = 0
integer :: temporary
!$omp parallel num_threads(100) shared(counter) private(temporary)
  temporary = counter
  temporary = temporary + 1
  counter = temporary
!$omp end parallel

In the absence of any locking, this will give quite varying results depending on exactly when each thread is scheduled. I think your example is similar to this.

@vzakhari
Copy link
Contributor Author

Tom, I think a valid program may process different parts of the same array in different threads (such as in example #2 in the doc) without any synchronization. The packing/unpacking breaks such a correct program just because it reads/modifies the whole array vs only the part that is supposed to be processed by the thread where pack/unpack is executed. And there is no way for packing/unpacking to know which part of the array needs to be copied.

@tblah
Copy link
Contributor

tblah commented Feb 27, 2025

I agree they can access the same array in different threads, what I mean is that without synchronization or atomic access, the results of doing so are not well defined.

But if you would prefer to be more conservative with the pass that is fine by me as well.

@tblah
Copy link
Contributor

tblah commented Mar 3, 2025

Having thought about this some more today, I now agree with you Slava. Previously I was thinking about the array accesses as concurrent accesses to the same variable from multiple threads. But I see you're right to think of each array element as a different variable because the indices can be organized (as in your example) to ensure that there are no concurrent accesses to the same memory address from different threads.

Thank you for baring with me as I thought this through.

I still think my previous suggestions are useful criteria to check for, although I now see that this is not sufficient on its own. We must not perform array repacking if there are

  • Memory barriers
  • Other operations which carry an implicit memory barrier
  • Atomic operations

Between the pack and unpack operations. I suppose this implies that the packed/repacked array isn't passed as an argument to any function calls that might contain any of the above operations (although as we can't ensure correctness in general, I would be willing to relax the function calls requirement if that gets in the way of optimizing parituclar application or benchmark kernels).

It is okay if you don't include these extra checks in any initial implementation, but if you agree, maybe this should be included in the design document.

@vzakhari
Copy link
Contributor Author

vzakhari commented Mar 6, 2025

Razvan, Tom, thank you again for the detailed review!

I added a section where I tried to track the ideas that were proposed. Please let me know if I missed something or can improve something.

I am comfortable with the document, and I will proceed to the implementation. I will be refining the document (where I left TBD notes) along the way.

@vzakhari vzakhari merged commit aa38754 into llvm:main Mar 10, 2025
12 checks passed
vzakhari added a commit to vzakhari/llvm-project that referenced this pull request Mar 11, 2025
vzakhari added a commit to vzakhari/llvm-project that referenced this pull request Mar 11, 2025
vzakhari added a commit that referenced this pull request Mar 11, 2025
vzakhari added a commit to vzakhari/llvm-project that referenced this pull request Mar 17, 2025
I am working on `-frepack-array` feature (llvm#127147), which produces
non-trivial manipulations with arguments of `fir.declare`.
In this case, we end up with CFG computation of the `fir.declare`
argument, and AddDebugInfo pass incorrectly mapped two dummy arguments
to the same arg index in the debug attributes.
This patch makes sure that we assign the arg index only if we can prove
that we've traced the block argument to the function's entry block.
I believe this problem is not specific to `-frepack-arrays`, e.g.
it may appear due to MLIR inlining as well.
vzakhari added a commit that referenced this pull request Mar 18, 2025
…fo. (#131672)

I am working on `-frepack-array` feature (#127147), which produces
non-trivial manipulations with arguments of `fir.declare`.
In this case, we end up with CFG computation of the `fir.declare`
argument, and AddDebugInfo pass incorrectly mapped two dummy arguments
to the same arg index in the debug attributes.
This patch makes sure that we assign the arg index only if we can prove
that we've traced the block argument to the function's entry block.
I believe this problem is not specific to `-frepack-arrays`, e.g.
it may appear due to MLIR inlining as well.
vzakhari added a commit to vzakhari/llvm-project that referenced this pull request Mar 19, 2025
`fir.pack_array` is just a pass-through op for the process
of finding the source in FIR alias analysis (as defined in llvm#127147).
vzakhari added a commit that referenced this pull request Apr 21, 2025
`fir.pack_array` is just a pass-through op for the process
of finding the source in FIR alias analysis (as defined in #127147).
IanWood1 pushed a commit to IanWood1/llvm-project that referenced this pull request May 6, 2025
`fir.pack_array` is just a pass-through op for the process
of finding the source in FIR alias analysis (as defined in llvm#127147).
IanWood1 pushed a commit to IanWood1/llvm-project that referenced this pull request May 6, 2025
`fir.pack_array` is just a pass-through op for the process
of finding the source in FIR alias analysis (as defined in llvm#127147).
IanWood1 pushed a commit to IanWood1/llvm-project that referenced this pull request May 6, 2025
`fir.pack_array` is just a pass-through op for the process
of finding the source in FIR alias analysis (as defined in llvm#127147).
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
flang Flang issues not falling into any other category
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants