-
-
Notifications
You must be signed in to change notification settings - Fork 18.5k
ENH: Add lazy copy to astype #50802
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
ENH: Add lazy copy to astype #50802
Changes from 14 commits
ed2e322
1967a2a
45ea989
a2e1e53
8f29fda
55fc3bc
a1847e8
75bdf1d
528b0fb
f29a0be
35b9fa4
664c31b
49a2029
e1dc971
894aaa8
273b5aa
31557be
c0f5003
031aaa5
174e2d3
876d293
ff37e58
5492c10
620181c
4f6f954
95a8296
7640c36
469383d
File filter
Filter by extension
Conversations
Jump to
Diff view
Diff view
There are no files selected for viewing
Original file line number | Diff line number | Diff line change |
---|---|---|
|
@@ -26,6 +26,7 @@ | |
is_dtype_equal, | ||
is_integer_dtype, | ||
is_object_dtype, | ||
is_string_dtype, | ||
is_timedelta64_dtype, | ||
pandas_dtype, | ||
) | ||
|
@@ -246,3 +247,45 @@ def astype_array_safe( | |
raise | ||
|
||
return new_values | ||
|
||
|
||
def astype_is_view(dtype: DtypeObj, new_dtype: DtypeObj) -> bool: | ||
"""Checks if astype avoided copying the data. | ||
|
||
Parameters | ||
---------- | ||
dtype : Original dtype | ||
new_dtype : target dtype | ||
|
||
Returns | ||
------- | ||
True if new data is a view, False otherwise | ||
""" | ||
if dtype == new_dtype: | ||
return True | ||
|
||
elif isinstance(dtype, np.dtype) and isinstance(new_dtype, np.dtype): | ||
# Only equal numpy dtypes avoid a copy | ||
return False | ||
|
||
if is_string_dtype(dtype) and is_string_dtype(new_dtype): | ||
return True | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. I would think this is not always guaranteed to be a view? (but of course safer to return True than incorrectly assume it is always a copy)
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Using this branch, I see the following wrong behaviour in case you have mixed objects:
Because of not taking a copy, the For the case of "object -> some extension dtype" casting, we should probably always do There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. I opened #51073 to address the inplace modification. You are correct, it is not guaranteed to be a no-copy op, but it could be and I couldn't figure out a more precise check. We can still optimise in a follow-up to get this stricter, for now I was aiming to get as many cases as possible without making it overly complex There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more.
Can you add a comment explaining the True and to note something like that? |
||
|
||
elif is_object_dtype(dtype) and new_dtype.kind == "O": | ||
# When the underlying array has dtype object, we don't have to make a copy | ||
return True | ||
Comment on lines
+275
to
+277
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Our own extension types fall in this category? (eg casting object column of Period objects to period dtype, although this specific example is actually never a view) Rather for a follow-up, would it be worth to have some more custom logic here for our own EAs? Both IntervalDtype and PeriodDtype have kind of There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Yep would prefer to do this as a follow up with specific tests |
||
|
||
elif is_string_dtype(dtype) or is_string_dtype(new_dtype): | ||
return False | ||
|
||
elif dtype.kind in "mM" and new_dtype.kind in "mM": | ||
return True | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Also to note for a follow-up, I assume now with multiple resolutions being supported for datetime64, we should maybe check those, since if you have different units, this is never a view? There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Actually, I am wondering: what is the purpose of this check? Because if you cast datetime64[ns]->datetime64[ns], that's already covered by the equal dtype case. Mixing datetime and timedelta is something we disallow explicitly (numpy allows it). So this is for DatetimeTZDtype? There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Yep tz aware stuff here. Removing this check causes test failures There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Added a bunch of additional tests and added a unit check here. |
||
|
||
elif getattr(dtype, "numpy_dtype", dtype) == getattr( | ||
new_dtype, "numpy_dtype", new_dtype | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. This is for our masked arrays? If so, can you add some comments explaining that? There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. pyarrow as well, both don't copy |
||
): | ||
# If underlying numpy dtype is the same, no copy is made, e.g. | ||
# int64 -> Int64 or int64[pyarrow] | ||
return True | ||
|
||
return False |
Original file line number | Diff line number | Diff line change |
---|---|---|
|
@@ -421,3 +421,14 @@ def test_array_to_numpy_na(): | |
result = arr.to_numpy(na_value=True, dtype=bool) | ||
expected = np.array([True, True]) | ||
tm.assert_numpy_array_equal(result, expected) | ||
|
||
|
||
def test_array_copy_on_write(using_copy_on_write): | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. This file is about pd.array(..) I think, so I would move it either to the decimal tests (extension/decimal/test_decimal), or to the other copy_view astype tests and import the DecimalDtype there |
||
df = pd.DataFrame({"a": [decimal.Decimal(2), decimal.Decimal(3)]}, dtype="object") | ||
df2 = df.astype(DecimalDtype()) | ||
df.iloc[0, 0] = 0 | ||
if using_copy_on_write: | ||
expected = pd.DataFrame( | ||
{"a": [decimal.Decimal(2), decimal.Decimal(3)]}, dtype=DecimalDtype() | ||
) | ||
tm.assert_equal(df2.values, expected.values) |
Original file line number | Diff line number | Diff line change | ||||||||||||||
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
@@ -1,6 +1,8 @@ | ||||||||||||||||
import numpy as np | ||||||||||||||||
import pytest | ||||||||||||||||
|
||||||||||||||||
from pandas.compat import pa_version_under6p0 | ||||||||||||||||
|
||||||||||||||||
from pandas import ( | ||||||||||||||||
DataFrame, | ||||||||||||||||
Index, | ||||||||||||||||
|
@@ -525,6 +527,138 @@ def test_to_frame(using_copy_on_write): | |||||||||||||||
tm.assert_frame_equal(df, expected) | ||||||||||||||||
|
||||||||||||||||
|
||||||||||||||||
def test_astype_single_dtype(using_copy_on_write): | ||||||||||||||||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Shall we move the astype tests to a dedicated file instead of in the middle of the other methods? My hunch is that we might need to add some more astype tests (if we specialize more for our own dtypes), and test_methods.py is already getting long There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Yeah sounds good |
||||||||||||||||
df = DataFrame({"a": [1, 2, 3], "b": [4, 5, 6], "c": 1.5}) | ||||||||||||||||
df_orig = df.copy() | ||||||||||||||||
df2 = df.astype("float64") | ||||||||||||||||
|
||||||||||||||||
if using_copy_on_write: | ||||||||||||||||
assert np.shares_memory(get_array(df2, "c"), get_array(df, "c")) | ||||||||||||||||
assert not np.shares_memory(get_array(df2, "a"), get_array(df, "a")) | ||||||||||||||||
else: | ||||||||||||||||
assert not np.shares_memory(get_array(df2, "c"), get_array(df, "c")) | ||||||||||||||||
assert not np.shares_memory(get_array(df2, "a"), get_array(df, "a")) | ||||||||||||||||
|
||||||||||||||||
# mutating df2 triggers a copy-on-write for that column/block | ||||||||||||||||
df2.iloc[0, 2] = 5.5 | ||||||||||||||||
if using_copy_on_write: | ||||||||||||||||
assert not np.shares_memory(get_array(df2, "c"), get_array(df, "c")) | ||||||||||||||||
tm.assert_frame_equal(df, df_orig) | ||||||||||||||||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more.
Suggested change
We don't test this consistently for all methods here, but astype seems a sufficiently complicated case (not just based on a copy(deep=False) under the hood) that it's probably good to be complete. (same for the ones below) There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Yep makes sense |
||||||||||||||||
|
||||||||||||||||
# mutating parent also doesn't update result | ||||||||||||||||
df2 = df.astype("float64") | ||||||||||||||||
df.iloc[0, 2] = 5.5 | ||||||||||||||||
tm.assert_frame_equal(df2, df_orig.astype("float64")) | ||||||||||||||||
|
||||||||||||||||
|
||||||||||||||||
@pytest.mark.parametrize("dtype", ["int64", "Int64"]) | ||||||||||||||||
@pytest.mark.parametrize("new_dtype", ["int64", "Int64", "int64[pyarrow]"]) | ||||||||||||||||
def test_astype_avoids_copy(using_copy_on_write, dtype, new_dtype): | ||||||||||||||||
if new_dtype == "int64[pyarrow]" and pa_version_under6p0: | ||||||||||||||||
pytest.skip("pyarrow not installed") | ||||||||||||||||
df = DataFrame({"a": [1, 2, 3]}, dtype=dtype) | ||||||||||||||||
df_orig = df.copy() | ||||||||||||||||
df2 = df.astype(new_dtype) | ||||||||||||||||
|
||||||||||||||||
if using_copy_on_write: | ||||||||||||||||
assert np.shares_memory(get_array(df2, "a"), get_array(df, "a")) | ||||||||||||||||
else: | ||||||||||||||||
assert not np.shares_memory(get_array(df2, "a"), get_array(df, "a")) | ||||||||||||||||
|
||||||||||||||||
# mutating df2 triggers a copy-on-write for that column/block | ||||||||||||||||
df2.iloc[0, 0] = 10 | ||||||||||||||||
if using_copy_on_write: | ||||||||||||||||
assert not np.shares_memory(get_array(df2, "a"), get_array(df, "a")) | ||||||||||||||||
tm.assert_frame_equal(df, df_orig) | ||||||||||||||||
|
||||||||||||||||
# mutating parent also doesn't update result | ||||||||||||||||
df2 = df.astype(new_dtype) | ||||||||||||||||
df.iloc[0, 0] = 100 | ||||||||||||||||
tm.assert_frame_equal(df2, df_orig.astype(new_dtype)) | ||||||||||||||||
|
||||||||||||||||
|
||||||||||||||||
@pytest.mark.parametrize("dtype", ["float64", "int32", "Int32", "int32[pyarrow]"]) | ||||||||||||||||
def test_astype_different_target_dtype(using_copy_on_write, dtype): | ||||||||||||||||
if dtype == "int32[pyarrow]" and pa_version_under6p0: | ||||||||||||||||
pytest.skip("pyarrow not installed") | ||||||||||||||||
df = DataFrame({"a": [1, 2, 3]}) | ||||||||||||||||
df_orig = df.copy() | ||||||||||||||||
df2 = df.astype(dtype) | ||||||||||||||||
|
||||||||||||||||
assert not np.shares_memory(get_array(df2, "a"), get_array(df, "a")) | ||||||||||||||||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more.
Suggested change
Maybe also explicitly check that There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Yes this is a very good check, would have made sure that I did not miss the numpy dtype thing below |
||||||||||||||||
|
||||||||||||||||
df2.iloc[0, 0] = 5 | ||||||||||||||||
tm.assert_frame_equal(df, df_orig) | ||||||||||||||||
|
||||||||||||||||
# mutating parent also doesn't update result | ||||||||||||||||
df2 = df.astype(dtype) | ||||||||||||||||
df.iloc[0, 0] = 100 | ||||||||||||||||
tm.assert_frame_equal(df2, df_orig.astype(dtype)) | ||||||||||||||||
|
||||||||||||||||
|
||||||||||||||||
@pytest.mark.parametrize( | ||||||||||||||||
"dtype, new_dtype", [("object", "string"), ("string", "object")] | ||||||||||||||||
) | ||||||||||||||||
def test_astype_string_and_object(using_copy_on_write, dtype, new_dtype): | ||||||||||||||||
df = DataFrame({"a": ["a", "b", "c"]}, dtype=dtype) | ||||||||||||||||
df_orig = df.copy() | ||||||||||||||||
df2 = df.astype(new_dtype) | ||||||||||||||||
|
||||||||||||||||
if using_copy_on_write: | ||||||||||||||||
assert np.shares_memory(get_array(df2, "a"), get_array(df, "a")) | ||||||||||||||||
else: | ||||||||||||||||
assert not np.shares_memory(get_array(df2, "a"), get_array(df, "a")) | ||||||||||||||||
|
||||||||||||||||
df2.iloc[0, 0] = "x" | ||||||||||||||||
tm.assert_frame_equal(df, df_orig) | ||||||||||||||||
|
||||||||||||||||
|
||||||||||||||||
@pytest.mark.parametrize( | ||||||||||||||||
"dtype, new_dtype", [("object", "string"), ("string", "object")] | ||||||||||||||||
) | ||||||||||||||||
def test_astype_string_and_object_update_original( | ||||||||||||||||
using_copy_on_write, dtype, new_dtype | ||||||||||||||||
): | ||||||||||||||||
df = DataFrame({"a": ["a", "b", "c"]}, dtype=dtype) | ||||||||||||||||
df2 = df.astype(new_dtype) | ||||||||||||||||
df_orig = df2.copy() | ||||||||||||||||
|
||||||||||||||||
if using_copy_on_write: | ||||||||||||||||
assert np.shares_memory(get_array(df2, "a"), get_array(df, "a")) | ||||||||||||||||
else: | ||||||||||||||||
assert not np.shares_memory(get_array(df2, "a"), get_array(df, "a")) | ||||||||||||||||
|
||||||||||||||||
df.iloc[0, 0] = "x" | ||||||||||||||||
tm.assert_frame_equal(df2, df_orig) | ||||||||||||||||
|
||||||||||||||||
|
||||||||||||||||
def test_astype_dict_dtypes(using_copy_on_write): | ||||||||||||||||
df = DataFrame( | ||||||||||||||||
{"a": [1, 2, 3], "b": [4, 5, 6], "c": Series([1.5, 1.5, 1.5], dtype="float64")} | ||||||||||||||||
) | ||||||||||||||||
df_orig = df.copy() | ||||||||||||||||
df2 = df.astype({"a": "float64", "c": "float64"}) | ||||||||||||||||
|
||||||||||||||||
if using_copy_on_write: | ||||||||||||||||
assert np.shares_memory(get_array(df2, "c"), get_array(df, "c")) | ||||||||||||||||
assert np.shares_memory(get_array(df2, "b"), get_array(df, "b")) | ||||||||||||||||
assert not np.shares_memory(get_array(df2, "a"), get_array(df, "a")) | ||||||||||||||||
else: | ||||||||||||||||
assert not np.shares_memory(get_array(df2, "c"), get_array(df, "c")) | ||||||||||||||||
assert not np.shares_memory(get_array(df2, "b"), get_array(df, "b")) | ||||||||||||||||
assert not np.shares_memory(get_array(df2, "a"), get_array(df, "a")) | ||||||||||||||||
|
||||||||||||||||
# mutating df2 triggers a copy-on-write for that column/block | ||||||||||||||||
df2.iloc[0, 2] = 5.5 | ||||||||||||||||
if using_copy_on_write: | ||||||||||||||||
assert not np.shares_memory(get_array(df2, "c"), get_array(df, "c")) | ||||||||||||||||
|
||||||||||||||||
df2.iloc[0, 1] = 10 | ||||||||||||||||
if using_copy_on_write: | ||||||||||||||||
assert not np.shares_memory(get_array(df2, "b"), get_array(df, "b")) | ||||||||||||||||
tm.assert_frame_equal(df, df_orig) | ||||||||||||||||
|
||||||||||||||||
|
||||||||||||||||
@pytest.mark.parametrize("ax", ["index", "columns"]) | ||||||||||||||||
def test_swapaxes_noop(using_copy_on_write, ax): | ||||||||||||||||
df = DataFrame({"a": [1, 2, 3], "b": [4, 5, 6]}) | ||||||||||||||||
|
Original file line number | Diff line number | Diff line change |
---|---|---|
|
@@ -867,13 +867,16 @@ def test_constructor_invalid_coerce_ints_with_float_nan(self, any_int_numpy_dtyp | |
with pytest.raises(IntCastingNaNError, match=msg): | ||
Series(np.array(vals), dtype=any_int_numpy_dtype) | ||
|
||
def test_constructor_dtype_no_cast(self): | ||
def test_constructor_dtype_no_cast(self, using_copy_on_write): | ||
# see gh-1572 | ||
s = Series([1, 2, 3]) | ||
s2 = Series(s, dtype=np.int64) | ||
|
||
s2[1] = 5 | ||
assert s[1] == 5 | ||
if using_copy_on_write: | ||
assert s[1] == 2 | ||
else: | ||
assert s[1] == 5 | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Hmm, so this wasn't yet covered when updating the Series(series) constructor (#49524). Could you add an explicit test for copy/view behaviour with a case like above to There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Just parametrizing the first test there should be sufficient to cover this case: diff --git a/pandas/tests/copy_view/test_constructors.py b/pandas/tests/copy_view/test_constructors.py
index c04c733e5e..7793c6cad5 100644
--- a/pandas/tests/copy_view/test_constructors.py
+++ b/pandas/tests/copy_view/test_constructors.py
@@ -1,3 +1,5 @@
+import pytest
+
import numpy as np
from pandas import Series
@@ -6,13 +8,14 @@ from pandas import Series
# Copy/view behaviour for Series / DataFrame constructors
-def test_series_from_series(using_copy_on_write):
+@pytest.mark.parametrize("dtype", [None, "int64"])
+def test_series_from_series(dtype, using_copy_on_write):
# Case: constructing a Series from another Series object follows CoW rules:
# a new object is returned and thus mutations are not propagated
ser = Series([1, 2, 3], name="name")
# default is copy=False -> new Series is a shallow copy / view of original
- result = Series(ser)
+ result = Series(ser, dtype=dtype)
# the shallow copy still shares memory
assert np.shares_memory(ser.values, result.values)
@@ -34,7 +37,7 @@ def test_series_from_series(using_copy_on_write):
assert np.shares_memory(ser.values, result.values)
# the same when modifying the parent
- result = Series(ser)
+ result = Series(ser, dtype=dtype)
if using_copy_on_write:
# mutating original doesn't mutate new series We should still add a test that ensure that if the cast requires a copy, we do not track references (to avoid a unnecessary copy later on), but that can be done later. There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. The copy case is more for your pr I guess? Adjusted the test accordingly |
||
|
||
def test_constructor_datelike_coercion(self): | ||
|
||
|
Uh oh!
There was an error while loading. Please reload this page.