Skip to content

BUG: Fix astype str issue 54654 #54687

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 13 commits into from
Oct 29, 2023
2 changes: 1 addition & 1 deletion doc/source/whatsnew/v2.2.0.rst
Original file line number Diff line number Diff line change
Expand Up @@ -188,7 +188,7 @@ Numeric

Conversion
^^^^^^^^^^
-
- Bug in :func:`astype` when called with ``str`` on unpickled array - the array might change in-place (:issue:`54654`)
-

Strings
Expand Down
2 changes: 1 addition & 1 deletion pandas/_libs/lib.pyx
Original file line number Diff line number Diff line change
Expand Up @@ -775,7 +775,7 @@ cpdef ndarray[object] ensure_string_array(

result = np.asarray(arr, dtype="object")

if copy and result is arr:
if copy and (result is arr or np.may_share_memory(arr, result)):
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

is there a reason for np.may_share_memory vs np.shares_memory? can you add a # GH#54654 below this line

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

According to the docs may_share_memory does only bounds check for the array, but may return false positives in general.
shares_memory is more computationally expensive.
In this case, asarray should return the same underlying array so bounds checking should be enough, but the worst case would be unnecessary copy in edge-cases instead of more expensive check in common cases
Although now that I'm thinking about it, I'm not sure how the internal shares_memory works.
If it first does bound check and then exact check it might work just the same, and there would be no reason not to use it instead

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Oh, but in case they do have overlapping bounds the check can be exponentially big, so I think it makes sense for this case to stop at bound check only (since asaaray probably returned the same memory)

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why does the pickle / unpickle have to change things here? I understand this fixes the issue but its not clear to me why it is required in the first place

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hey Will,
The problem is not exactly with pickle, it's just a quick way to reproduce the problem.
The problem is that the code here attempts to check if two arrays have the same memory (or share memory) and it does so incorrectly - result is arr
See numpy/numpy#24478 for more technical details.
BTW - it seems the resolution for this issue will be to enhance the docs to state that checking identity on two arrays is not a proper way to verify they don't share memory.

Also, I see there is a merge conflict now, so I'll fix it

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah OK thanks for the background info. So shouldn't we be removing the result is arr check altogether as part of this?

As far as the shares_memory versus may_share_memory disccusion, is the main difference when taking a view of an array that reinterprets bytes? I'd probably just prefer shares_memory to start knowing that it is more correct at the cost of more performance; performance can always be improved later (as long as it doesn't introduce a serious regression here)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I left result is arr since in the most common case asarray just return the same array without modification, so identity check is a pretty cheap validation
The reason I prefered may_share_memory is due to numpy warning: 'If in doubt, use numpy.may_share_memory instead.'
Let me state my reasoning:

  • asarray should either copy or return the same array/memory
  • If the memory is in the same location - it was not copied. bounds check should suffice here since there shouldn't be any views of the array created in this function
  • copy should be less expensive than exhaust check of the shared memory

regarding regression - I'm not sure Pandas has enough tests in place to check the performance regression in this case since it's a pretty unique use case

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yea that makes sense but we use shares_memory throughout our codebase where may_share_memory is little used; if that's something we wanted to change I think it is outside the scope of this PR (may be worth doing)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That sounds reasonable, I'll change it to shares_memory

result = result.copy()
elif not copy and result is arr:
already_copied = False
Expand Down
11 changes: 11 additions & 0 deletions pandas/tests/copy_view/test_astype.py
Original file line number Diff line number Diff line change
@@ -1,3 +1,5 @@
import pickle

import numpy as np
import pytest

Expand Down Expand Up @@ -131,6 +133,15 @@ def test_astype_string_and_object_update_original(
tm.assert_frame_equal(df2, df_orig)


def test_astype_string_copy_on_pickle_roundrip():
# https://github.com/pandas-dev/pandas/issues/54654
# ensure_string_array may alter array inplace
base = Series(np.array([(1, 2), None, 1], dtype="object"))
base_copy = pickle.loads(pickle.dumps(base))
base_copy.astype(str)
tm.assert_series_equal(base, base_copy)


def test_astype_dict_dtypes(using_copy_on_write):
df = DataFrame(
{"a": [1, 2, 3], "b": [4, 5, 6], "c": Series([1.5, 1.5, 1.5], dtype="float64")}
Expand Down