-
Notifications
You must be signed in to change notification settings - Fork 533
doc: update pytorch-on-xla-devices and troubleshoot doc for tensor synchronization issue #9258
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: master
Are you sure you want to change the base?
Conversation
* The tensor value is being accessed during tracing (``tensor[0]``) | ||
* The resulting graph becomes fixed based on the tensor value available during tracing | ||
* Developers might incorrectly assume the condition will be evaluated dynamically during inference | ||
* The solution for the code above is to utilize the debugging flags below to catch the issue and modify the code |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It would be good to include an example of how the code above can be changed to prevent tensor synchronization.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Added in the new revision. Thanks!
|
||
* Use ``PT_XLA_DEBUG_LEVEL=2`` during initial development to identify potential synchronization points | ||
* Apply ``_set_allow_execution(False)`` when you want to ensure no tensor synchronization occurs during tracing | ||
* When seeing warnings or errors related the tensor synchronization, look into the code path and make appropriate changes |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
When you see warnings or errors related to tensor synchronization, look into the code path and make appropriate changes.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
updating. Thanks!
|
||
* Use ``PT_XLA_DEBUG_LEVEL=2`` during initial development to identify potential synchronization points | ||
* Apply ``_set_allow_execution(False)`` when you want to ensure no tensor synchronization occurs during tracing | ||
* When seeing warnings or errors related the tensor synchronization, look into the code path and make appropriate changes |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can we provide examples of "appropriate changes"?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Provided in the new revision. Thanks!
docs/source/learn/troubleshoot.md
Outdated
Some common causes of Compilation/Executation are 1. User manually call | ||
`torch_xla.sync()`. 2. [Parallel | ||
Some common causes of Compilation/Executation are | ||
1. User manually call |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
- User manually calls
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
thanks for improving the readability.
docs/source/learn/troubleshoot.md
Outdated
[profiler StepTrace | ||
region](https://github.com/pytorch/xla/blob/fe4af0080af07f78ca2b614dd91b71885a3bbbb8/torch_xla/debug/profiler.py#L165-L171). | ||
4. Dynamo decide to compile/execute the graph. 5. User trying to | ||
4. Dynamo decide to compile/execute the graph. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
- Dynamo decides to compile....
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
thanks for improving the readability.
docs/source/learn/troubleshoot.md
Outdated
[profiler StepTrace | ||
region](https://github.com/pytorch/xla/blob/fe4af0080af07f78ca2b614dd91b71885a3bbbb8/torch_xla/debug/profiler.py#L165-L171). | ||
4. Dynamo decide to compile/execute the graph. 5. User trying to | ||
4. Dynamo decide to compile/execute the graph. | ||
5. User trying to |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
- User tries to access (often due to logging) the value of a tensor before calling
torch_xla.sync()
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
thanks for improving the readability.
docs/source/learn/troubleshoot.md
Outdated
access(often due to logging) the value of a tensor before the | ||
`torch_xla.sync()`. | ||
6. User trying to access tensor value before mark_step. See [PyTorch on XLA Devices](https://github.com/pytorch/xla/blob/master/docs/source/learn/pytorch-on-xla-devices.md) for more details. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
- User tries to access a tensor value before calling
mark_step
. See PyTorch on XLA Devices for more details.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
thanks for improving the readability.
docs/source/learn/troubleshoot.md
Outdated
@@ -137,15 +137,20 @@ Execution Analysis: ------------------------------------------------------------ | |||
Execution Analysis: ================================================================================ | |||
``` | |||
|
|||
Some common causes of Compilation/Executation are 1. User manually call | |||
`torch_xla.sync()`. 2. [Parallel | |||
Some common causes of Compilation/Executation are |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Some common causes of compilation/execution are ...
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
thanks for improving the readability.
docs/source/learn/troubleshoot.md
Outdated
access(often due to logging) the value of a tensor before the | ||
`torch_xla.sync()`. | ||
6. User trying to access tensor value before mark_step. See [PyTorch on XLA Devices](https://github.com/pytorch/xla/blob/master/docs/source/learn/pytorch-on-xla-devices.md) for more details. | ||
|
||
The execution caused by 1-4 are expected, and we want to avoid 5 by |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The op executions caused by items 1-4 are expected, and we want to avoid item 5 by
either reducing the frequency of accessing tensor values or manually adding a call to
torch_xla.sync()
before accessing them.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
thanks for improving the readability.
Hi @mikegre-google , I updated the docs based on your comments. Let me know if it looks good. Thanks for your review! |
Hi @tengyifei Thanks for your comments! I updated the doc based on that. Please let me know if that addresses your comments. |
docs/source/learn/troubleshoot.md
Outdated
[profiler StepTrace | ||
region](https://github.com/pytorch/xla/blob/fe4af0080af07f78ca2b614dd91b71885a3bbbb8/torch_xla/debug/profiler.py#L165-L171). | ||
4. Dynamo decide to compile/execute the graph. 5. User trying to | ||
4. Dynamo decides to compile/execute the graph. | ||
5. User tries to | ||
access(often due to logging) the value of a tensor before the |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Space needed after "access"
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
updated. Thanks.
docs/source/learn/troubleshoot.md
Outdated
access(often due to logging) the value of a tensor before the | ||
`torch_xla.sync()`. | ||
6. User tries to a tensor value before calling `mark_step`. See [PyTorch on XLA Devices](https://github.com/pytorch/xla/blob/master/docs/source/learn/pytorch-on-xla-devices.md) for more details. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
User tries to access a tensor value ...?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
updated. Thanks.
Head branch was pushed to by a user without write access
Notes