-
-
Notifications
You must be signed in to change notification settings - Fork 18.5k
API (string dtype): implement hierarchy (NA > NaN, pyarrow > python) for consistent comparisons between different string dtypes #61138
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Merged
mroeschke
merged 12 commits into
pandas-dev:main
from
jorisvandenbossche:string-dtype-comparison-methods-priority
May 19, 2025
Merged
Changes from all commits
Commits
Show all changes
12 commits
Select commit
Hold shift + click to select a range
3c4d782
API (string dtype): implement hierarchy (NA > NaN, pyarrow > python) …
jorisvandenbossche 7ffb08f
fix string arith tests
jorisvandenbossche 48907c3
fix for build without pyarrow
jorisvandenbossche 2058120
fix xfail condition
jorisvandenbossche 4ebd93b
fix type annotation
jorisvandenbossche 33db5d0
Merge remote-tracking branch 'upstream/main' into string-dtype-compar…
jorisvandenbossche 51340a9
re-add test with list
jorisvandenbossche e2bfe18
cleanup
jorisvandenbossche 5ba3577
Merge branch 'main' of https://github.com/pandas-dev/pandas into stri…
rhshadrach 846afff
Fix ArrowExtensionArray and add whatsnew
rhshadrach 99475e6
fixup
rhshadrach b481d7a
Merge branch 'main' into string-dtype-comparison-methods-priority
mroeschke File filter
Filter by extension
Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
There are no files selected for viewing
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
|
@@ -10,10 +10,12 @@ | |
|
||
from pandas._config import using_string_dtype | ||
|
||
from pandas.compat import HAS_PYARROW | ||
from pandas.compat.pyarrow import ( | ||
pa_version_under12p0, | ||
pa_version_under19p0, | ||
) | ||
import pandas.util._test_decorators as td | ||
|
||
from pandas.core.dtypes.common import is_dtype_equal | ||
|
||
|
@@ -45,6 +47,25 @@ def cls(dtype): | |
return dtype.construct_array_type() | ||
|
||
|
||
def string_dtype_highest_priority(dtype1, dtype2): | ||
if HAS_PYARROW: | ||
DTYPE_HIERARCHY = [ | ||
pd.StringDtype("python", na_value=np.nan), | ||
pd.StringDtype("pyarrow", na_value=np.nan), | ||
pd.StringDtype("python", na_value=pd.NA), | ||
pd.StringDtype("pyarrow", na_value=pd.NA), | ||
] | ||
else: | ||
DTYPE_HIERARCHY = [ | ||
pd.StringDtype("python", na_value=np.nan), | ||
pd.StringDtype("python", na_value=pd.NA), | ||
] | ||
|
||
h1 = DTYPE_HIERARCHY.index(dtype1) | ||
h2 = DTYPE_HIERARCHY.index(dtype2) | ||
return DTYPE_HIERARCHY[max(h1, h2)] | ||
|
||
|
||
def test_dtype_constructor(): | ||
pytest.importorskip("pyarrow") | ||
|
||
|
@@ -331,25 +352,75 @@ def test_comparison_methods_scalar_not_string(comparison_op, dtype): | |
tm.assert_extension_array_equal(result, expected) | ||
|
||
|
||
def test_comparison_methods_array(comparison_op, dtype): | ||
def test_comparison_methods_array(comparison_op, dtype, dtype2): | ||
op_name = f"__{comparison_op.__name__}__" | ||
|
||
a = pd.array(["a", None, "c"], dtype=dtype) | ||
other = [None, None, "c"] | ||
result = getattr(a, op_name)(other) | ||
if dtype.na_value is np.nan: | ||
other = pd.array([None, None, "c"], dtype=dtype2) | ||
result = comparison_op(a, other) | ||
|
||
# ensure operation is commutative | ||
result2 = comparison_op(other, a) | ||
tm.assert_equal(result, result2) | ||
|
||
if dtype.na_value is np.nan and dtype2.na_value is np.nan: | ||
if operator.ne == comparison_op: | ||
expected = np.array([True, True, False]) | ||
else: | ||
expected = np.array([False, False, False]) | ||
expected[-1] = getattr(other[-1], op_name)(a[-1]) | ||
tm.assert_numpy_array_equal(result, expected) | ||
|
||
result = getattr(a, op_name)(pd.NA) | ||
else: | ||
max_dtype = string_dtype_highest_priority(dtype, dtype2) | ||
if max_dtype.storage == "python": | ||
expected_dtype = "boolean" | ||
else: | ||
expected_dtype = "bool[pyarrow]" | ||
|
||
expected = np.full(len(a), fill_value=None, dtype="object") | ||
expected[-1] = getattr(other[-1], op_name)(a[-1]) | ||
expected = pd.array(expected, dtype=expected_dtype) | ||
tm.assert_extension_array_equal(result, expected) | ||
|
||
|
||
@td.skip_if_no("pyarrow") | ||
def test_comparison_methods_array_arrow_extension(comparison_op, dtype2): | ||
# Test pd.ArrowDtype(pa.string()) against other string arrays | ||
import pyarrow as pa | ||
|
||
op_name = f"__{comparison_op.__name__}__" | ||
dtype = pd.ArrowDtype(pa.string()) | ||
a = pd.array(["a", None, "c"], dtype=dtype) | ||
other = pd.array([None, None, "c"], dtype=dtype2) | ||
result = comparison_op(a, other) | ||
|
||
# ensure operation is commutative | ||
result2 = comparison_op(other, a) | ||
tm.assert_equal(result, result2) | ||
|
||
expected = pd.array([None, None, True], dtype="bool[pyarrow]") | ||
expected[-1] = getattr(other[-1], op_name)(a[-1]) | ||
tm.assert_extension_array_equal(result, expected) | ||
|
||
|
||
def test_comparison_methods_list(comparison_op, dtype): | ||
op_name = f"__{comparison_op.__name__}__" | ||
|
||
a = pd.array(["a", None, "c"], dtype=dtype) | ||
other = [None, None, "c"] | ||
result = comparison_op(a, other) | ||
|
||
# ensure operation is commutative | ||
result2 = comparison_op(other, a) | ||
tm.assert_equal(result, result2) | ||
|
||
if dtype.na_value is np.nan: | ||
if operator.ne == comparison_op: | ||
expected = np.array([True, True, True]) | ||
expected = np.array([True, True, False]) | ||
else: | ||
expected = np.array([False, False, False]) | ||
expected[-1] = getattr(other[-1], op_name)(a[-1]) | ||
tm.assert_numpy_array_equal(result, expected) | ||
|
||
else: | ||
|
@@ -359,10 +430,6 @@ def test_comparison_methods_array(comparison_op, dtype): | |
expected = pd.array(expected, dtype=expected_dtype) | ||
tm.assert_extension_array_equal(result, expected) | ||
|
||
result = getattr(a, op_name)(pd.NA) | ||
expected = pd.array([None, None, None], dtype=expected_dtype) | ||
tm.assert_extension_array_equal(result, expected) | ||
Comment on lines
-362
to
-364
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. For this case of comparing with NA, we already have a dedicated test just above, so removing it here |
||
|
||
|
||
def test_constructor_raises(cls): | ||
if cls is pd.arrays.StringArray: | ||
|
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Oops, something went wrong.
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think this is not correct, it returns just the nullable
boolean
dtype? (i.e.pd.BooleanDtype()
) Whereboolean[pyarrow]
is an alias forpd.ArrowDtype(pa.boolean())
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hmm, I see that it is actually the behaviour with this PR as well, but I thought I would have "fixed" that while making things consistent
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
And I also see that I coded explicitly myself this expected dtype in the tests ...
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Given the ordering
when we compare
(pyarrow, NA)
with anything we want the result to be as if we compared(pyarrow, NA)
with itself, which should result inboolean[pyarrow]
.