doc: update pytorch-on-xla-devices and troubleshoot doc for tensor synchronization issue #9258

aws-yyjau · 2025-05-28T21:12:35Z

Notes

Following up on test: add test_xla_graph_execution to test flags (_set_allow_execution with PT_XLA_DEBUG_LEVEL) #9171, we'd like to update the documentation so that this feature can be properly used by neuron (and other XLA) customers.

mikegre-google · 2025-05-28T21:16:40Z

docs/source/learn/pytorch-on-xla-devices.md

+* The tensor value is being accessed during tracing (``tensor[0]``)
+* The resulting graph becomes fixed based on the tensor value available during tracing
+* Developers might incorrectly assume the condition will be evaluated dynamically during inference
+* The solution for the code above is to utilize the debugging flags below to catch the issue and modify the code


It would be good to include an example of how the code above can be changed to prevent tensor synchronization.

Added in the new revision. Thanks!

mikegre-google · 2025-05-28T21:17:46Z

docs/source/learn/pytorch-on-xla-devices.md

+
+* Use ``PT_XLA_DEBUG_LEVEL=2`` during initial development to identify potential synchronization points
+* Apply ``_set_allow_execution(False)`` when you want to ensure no tensor synchronization occurs during tracing
+* When seeing warnings or errors related the tensor synchronization, look into the code path and make appropriate changes


When you see warnings or errors related to tensor synchronization, look into the code path and make appropriate changes.

updating. Thanks!

mikegre-google · 2025-05-28T21:18:08Z

docs/source/learn/pytorch-on-xla-devices.md

+
+* Use ``PT_XLA_DEBUG_LEVEL=2`` during initial development to identify potential synchronization points
+* Apply ``_set_allow_execution(False)`` when you want to ensure no tensor synchronization occurs during tracing
+* When seeing warnings or errors related the tensor synchronization, look into the code path and make appropriate changes


Can we provide examples of "appropriate changes"?

Provided in the new revision. Thanks!

mikegre-google · 2025-05-28T21:18:29Z

docs/source/learn/troubleshoot.md

-Some common causes of Compilation/Executation are 1. User manually call
-`torch_xla.sync()`. 2. [Parallel
+Some common causes of Compilation/Executation are 
+1. User manually call


User manually calls

thanks for improving the readability.

mikegre-google · 2025-05-28T21:19:09Z

docs/source/learn/troubleshoot.md

 [profiler StepTrace
 region](https://github.com/pytorch/xla/blob/fe4af0080af07f78ca2b614dd91b71885a3bbbb8/torch_xla/debug/profiler.py#L165-L171).
-4. Dynamo decide to compile/execute the graph. 5. User trying to
+4. Dynamo decide to compile/execute the graph. 


Dynamo decides to compile....

thanks for improving the readability.

mikegre-google · 2025-05-28T21:20:02Z

docs/source/learn/troubleshoot.md

 [profiler StepTrace
 region](https://github.com/pytorch/xla/blob/fe4af0080af07f78ca2b614dd91b71885a3bbbb8/torch_xla/debug/profiler.py#L165-L171).
-4. Dynamo decide to compile/execute the graph. 5. User trying to
+4. Dynamo decide to compile/execute the graph. 
+5. User trying to


User tries to access (often due to logging) the value of a tensor before calling torch_xla.sync()

thanks for improving the readability.

mikegre-google · 2025-05-28T21:20:42Z

docs/source/learn/troubleshoot.md

 access(often due to logging) the value of a tensor before the
 `torch_xla.sync()`.
+6. User trying to access tensor value before mark_step. See [PyTorch on XLA Devices](https://github.com/pytorch/xla/blob/master/docs/source/learn/pytorch-on-xla-devices.md) for more details.


User tries to access a tensor value before calling mark_step. See PyTorch on XLA Devices for more details.

thanks for improving the readability.

mikegre-google · 2025-05-28T21:22:18Z

docs/source/learn/troubleshoot.md

@@ -137,15 +137,20 @@ Execution Analysis: ------------------------------------------------------------
 Execution Analysis: ================================================================================
 ```

-Some common causes of Compilation/Executation are 1. User manually call
-`torch_xla.sync()`. 2. [Parallel
+Some common causes of Compilation/Executation are 


Some common causes of compilation/execution are ...

thanks for improving the readability.

mikegre-google · 2025-05-28T21:27:22Z

docs/source/learn/troubleshoot.md

 access(often due to logging) the value of a tensor before the
 `torch_xla.sync()`.
+6. User trying to access tensor value before mark_step. See [PyTorch on XLA Devices](https://github.com/pytorch/xla/blob/master/docs/source/learn/pytorch-on-xla-devices.md) for more details.

 The execution caused by 1-4 are expected, and we want to avoid 5 by


The op executions caused by items 1-4 are expected, and we want to avoid item 5 by
either reducing the frequency of accessing tensor values or manually adding a call to
torch_xla.sync() before accessing them.

thanks for improving the readability.

aws-yyjau · 2025-05-29T00:32:44Z

Hi @mikegre-google ,

I updated the docs based on your comments. Let me know if it looks good. Thanks for your review!

docs/source/learn/pytorch-on-xla-devices.md

aws-yyjau · 2025-05-30T17:06:59Z

Hi @tengyifei

Thanks for your comments! I updated the doc based on that. Please let me know if that addresses your comments.

mikegre-google · 2025-06-02T14:35:56Z

docs/source/learn/troubleshoot.md

 [profiler StepTrace
 region](https://github.com/pytorch/xla/blob/fe4af0080af07f78ca2b614dd91b71885a3bbbb8/torch_xla/debug/profiler.py#L165-L171).
-4. Dynamo decide to compile/execute the graph. 5. User trying to
+4. Dynamo decides to compile/execute the graph. 
+5. User tries to
 access(often due to logging) the value of a tensor before the


Space needed after "access"

updated. Thanks.

mikegre-google · 2025-06-02T14:36:38Z

docs/source/learn/troubleshoot.md

 access(often due to logging) the value of a tensor before the
 `torch_xla.sync()`.
+6. User tries to a tensor value before calling `mark_step`. See [PyTorch on XLA Devices](https://github.com/pytorch/xla/blob/master/docs/source/learn/pytorch-on-xla-devices.md) for more details.


User tries to access a tensor value ...?

updated. Thanks.

aws-yyjau added 3 commits May 28, 2025 19:04

doc: update pytorch-on-xla-devices doc for tensor synchronization issue

3358752

format common causes and add tensor value access

24cc94e

add description

f6b1dce

aws-yyjau requested review from mikegre-google and tengyifei as code owners May 28, 2025 21:12

mikegre-google reviewed May 28, 2025

View reviewed changes

aws-yyjau added 2 commits May 29, 2025 00:28

improve documentation based on comments

a5a375b

update documentation based on comments

4989631

tengyifei requested changes May 30, 2025

View reviewed changes

update documentation based on comments

cda7c36

tengyifei approved these changes May 30, 2025

View reviewed changes

tengyifei approved these changes Jun 2, 2025

View reviewed changes

tengyifei enabled auto-merge (squash) June 2, 2025 02:48

qihqi requested a review from mikegre-google June 2, 2025 03:31

mikegre-google reviewed Jun 2, 2025

View reviewed changes

update documentation

e2b0d86

auto-merge was automatically disabled June 2, 2025 18:40
Head branch was pushed to by a user without write access

update grammar

cca09ca

qihqi requested a review from mikegre-google June 3, 2025 03:05

doc: update pytorch-on-xla-devices and troubleshoot doc for tensor synchronization issue #9258

Are you sure you want to change the base?

doc: update pytorch-on-xla-devices and troubleshoot doc for tensor synchronization issue #9258

Uh oh!

Conversation

aws-yyjau commented May 28, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Notes

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

aws-yyjau commented May 29, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

aws-yyjau commented May 30, 2025

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

aws-yyjau commented May 28, 2025 •

edited

Loading

aws-yyjau commented May 29, 2025 •

edited

Loading