Skip to content

doc: update pytorch-on-xla-devices and troubleshoot doc for tensor synchronization issue #9258

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
wants to merge 8 commits into
base: master
Choose a base branch
from

Conversation

aws-yyjau
Copy link
Contributor

@aws-yyjau aws-yyjau commented May 28, 2025

Notes

  1. Following up on test: add test_xla_graph_execution to test flags (_set_allow_execution with PT_XLA_DEBUG_LEVEL) #9171, we'd like to update the documentation so that this feature can be properly used by neuron (and other XLA) customers.

* The tensor value is being accessed during tracing (``tensor[0]``)
* The resulting graph becomes fixed based on the tensor value available during tracing
* Developers might incorrectly assume the condition will be evaluated dynamically during inference
* The solution for the code above is to utilize the debugging flags below to catch the issue and modify the code
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It would be good to include an example of how the code above can be changed to prevent tensor synchronization.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Added in the new revision. Thanks!


* Use ``PT_XLA_DEBUG_LEVEL=2`` during initial development to identify potential synchronization points
* Apply ``_set_allow_execution(False)`` when you want to ensure no tensor synchronization occurs during tracing
* When seeing warnings or errors related the tensor synchronization, look into the code path and make appropriate changes
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

When you see warnings or errors related to tensor synchronization, look into the code path and make appropriate changes.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

updating. Thanks!


* Use ``PT_XLA_DEBUG_LEVEL=2`` during initial development to identify potential synchronization points
* Apply ``_set_allow_execution(False)`` when you want to ensure no tensor synchronization occurs during tracing
* When seeing warnings or errors related the tensor synchronization, look into the code path and make appropriate changes
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we provide examples of "appropriate changes"?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Provided in the new revision. Thanks!

Some common causes of Compilation/Executation are 1. User manually call
`torch_xla.sync()`. 2. [Parallel
Some common causes of Compilation/Executation are
1. User manually call
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

  1. User manually calls

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

thanks for improving the readability.

[profiler StepTrace
region](https://github.com/pytorch/xla/blob/fe4af0080af07f78ca2b614dd91b71885a3bbbb8/torch_xla/debug/profiler.py#L165-L171).
4. Dynamo decide to compile/execute the graph. 5. User trying to
4. Dynamo decide to compile/execute the graph.
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

  1. Dynamo decides to compile....

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

thanks for improving the readability.

[profiler StepTrace
region](https://github.com/pytorch/xla/blob/fe4af0080af07f78ca2b614dd91b71885a3bbbb8/torch_xla/debug/profiler.py#L165-L171).
4. Dynamo decide to compile/execute the graph. 5. User trying to
4. Dynamo decide to compile/execute the graph.
5. User trying to
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

  1. User tries to access (often due to logging) the value of a tensor before calling torch_xla.sync()

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

thanks for improving the readability.

access(often due to logging) the value of a tensor before the
`torch_xla.sync()`.
6. User trying to access tensor value before mark_step. See [PyTorch on XLA Devices](https://github.com/pytorch/xla/blob/master/docs/source/learn/pytorch-on-xla-devices.md) for more details.
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

  1. User tries to access a tensor value before calling mark_step. See PyTorch on XLA Devices for more details.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

thanks for improving the readability.

@@ -137,15 +137,20 @@ Execution Analysis: ------------------------------------------------------------
Execution Analysis: ================================================================================
```

Some common causes of Compilation/Executation are 1. User manually call
`torch_xla.sync()`. 2. [Parallel
Some common causes of Compilation/Executation are
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Some common causes of compilation/execution are ...

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

thanks for improving the readability.

access(often due to logging) the value of a tensor before the
`torch_xla.sync()`.
6. User trying to access tensor value before mark_step. See [PyTorch on XLA Devices](https://github.com/pytorch/xla/blob/master/docs/source/learn/pytorch-on-xla-devices.md) for more details.

The execution caused by 1-4 are expected, and we want to avoid 5 by
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The op executions caused by items 1-4 are expected, and we want to avoid item 5 by
either reducing the frequency of accessing tensor values or manually adding a call to
torch_xla.sync() before accessing them.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

thanks for improving the readability.

@aws-yyjau
Copy link
Contributor Author

aws-yyjau commented May 29, 2025

Hi @mikegre-google ,

I updated the docs based on your comments. Let me know if it looks good. Thanks for your review!

@aws-yyjau
Copy link
Contributor Author

Hi @tengyifei

Thanks for your comments! I updated the doc based on that. Please let me know if that addresses your comments.

@tengyifei tengyifei enabled auto-merge (squash) June 2, 2025 02:48
@qihqi qihqi requested a review from mikegre-google June 2, 2025 03:31
[profiler StepTrace
region](https://github.com/pytorch/xla/blob/fe4af0080af07f78ca2b614dd91b71885a3bbbb8/torch_xla/debug/profiler.py#L165-L171).
4. Dynamo decide to compile/execute the graph. 5. User trying to
4. Dynamo decides to compile/execute the graph.
5. User tries to
access(often due to logging) the value of a tensor before the
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Space needed after "access"

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

updated. Thanks.

access(often due to logging) the value of a tensor before the
`torch_xla.sync()`.
6. User tries to a tensor value before calling `mark_step`. See [PyTorch on XLA Devices](https://github.com/pytorch/xla/blob/master/docs/source/learn/pytorch-on-xla-devices.md) for more details.
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

User tries to access a tensor value ...?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

updated. Thanks.

auto-merge was automatically disabled June 2, 2025 18:40

Head branch was pushed to by a user without write access

@qihqi qihqi requested a review from mikegre-google June 3, 2025 03:05
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants