-
-
Notifications
You must be signed in to change notification settings - Fork 18.5k
BUG: Fix astype str issue 54654 #54687
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Merged
mroeschke
merged 13 commits into
pandas-dev:main
from
Itayazolay:fix-astype-str-issue-54654
Oct 29, 2023
Merged
Changes from 7 commits
Commits
Show all changes
13 commits
Select commit
Hold shift + click to select a range
5c8c2ec
Fix issue #54654
itay-jether d2f25c2
changelog
itay-jether 3e39fed
Update v2.2.0.rst
Itayazolay 0e16b1c
Merge branch 'pandas-dev:main' into fix-astype-str-issue-54654
Itayazolay 14073c9
Merge branch 'main' into fix-astype-str-issue-54654
Itayazolay b72f7a0
merge main
Itayazolay 4a55256
rephrase
Itayazolay 47ee1be
Update lib.pyx
Itayazolay 8b66fdb
Merge branch 'main' into fix-astype-str-issue-54654
Itayazolay 2e78a4f
Update v2.2.0.rst
Itayazolay 94ef701
Merge branch 'main' into fix-astype-str-issue-54654
Itayazolay 622a724
Update lib.pyx
Itayazolay 51f42f8
Merge branch 'main' into fix-astype-str-issue-54654
Itayazolay File filter
Filter by extension
Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
There are no files selected for viewing
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
is there a reason for np.may_share_memory vs np.shares_memory? can you add a
# GH#54654
below this lineThere was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
According to the docs
may_share_memory
does only bounds check for the array, but may return false positives in general.shares_memory
is more computationally expensive.In this case,
asarray
should return the same underlying array so bounds checking should be enough, but the worst case would be unnecessary copy in edge-cases instead of more expensive check in common casesAlthough now that I'm thinking about it, I'm not sure how the internal
shares_memory
works.If it first does bound check and then exact check it might work just the same, and there would be no reason not to use it instead
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Oh, but in case they do have overlapping bounds the check can be exponentially big, so I think it makes sense for this case to stop at bound check only (since asaaray probably returned the same memory)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why does the pickle / unpickle have to change things here? I understand this fixes the issue but its not clear to me why it is required in the first place
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hey Will,
The problem is not exactly with pickle, it's just a quick way to reproduce the problem.
The problem is that the code here attempts to check if two arrays have the same memory (or share memory) and it does so incorrectly -
result is arr
See numpy/numpy#24478 for more technical details.
BTW - it seems the resolution for this issue will be to enhance the docs to state that checking identity on two arrays is not a proper way to verify they don't share memory.
Also, I see there is a merge conflict now, so I'll fix it
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ah OK thanks for the background info. So shouldn't we be removing the
result is arr
check altogether as part of this?As far as the
shares_memory
versusmay_share_memory
disccusion, is the main difference when taking a view of an array that reinterprets bytes? I'd probably just prefershares_memory
to start knowing that it is more correct at the cost of more performance; performance can always be improved later (as long as it doesn't introduce a serious regression here)There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I left
result is arr
since in the most common case asarray just return the same array without modification, so identity check is a pretty cheap validationThe reason I prefered
may_share_memory
is due tonumpy
warning: 'If in doubt, usenumpy.may_share_memory
instead.'Let me state my reasoning:
asarray
should either copy or return the same array/memoryregarding regression - I'm not sure Pandas has enough tests in place to check the performance regression in this case since it's a pretty unique use case
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yea that makes sense but we use
shares_memory
throughout our codebase wheremay_share_memory
is little used; if that's something we wanted to change I think it is outside the scope of this PR (may be worth doing)There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
That sounds reasonable, I'll change it to shares_memory