API (string dtype): implement hierarchy (NA > NaN, pyarrow > python) for consistent comparisons between different string dtypes #61138

jorisvandenbossche · 2025-03-17T10:08:28Z

This does not yet handle the case of comparison to object dtype.

Tests added and passed if fixing a bug or adding a new feature
Added an entry in the latest doc/source/whatsnew/vX.X.X.rst file if fixing a bug or adding a new feature.

…for consistent comparisons between different string dtypes

mroeschke · 2025-03-19T17:13:40Z

pandas/tests/arrays/string_/test_string.py

-        expected = pd.array([None, None, None], dtype=expected_dtype)
-        tm.assert_extension_array_equal(result, expected)
+    # # with list
+    # other = [None, None, "c"]


Did you want to implement testing this in this PR?

Yes, this was already implemented, just need to add this case back to the test. The original "array" test was actually testing with a list. I updated the test to now actually use an array (parametrized with all the different dtypes, to get all combinations of dtypes in both operands), and added a separate test with just the list.

…ison-methods-priority

jorisvandenbossche · 2025-03-26T08:22:53Z

pandas/tests/arrays/string_/test_string.py

-        result = getattr(a, op_name)(pd.NA)
-        expected = pd.array([None, None, None], dtype=expected_dtype)
-        tm.assert_extension_array_equal(result, expected)


For this case of comparing with NA, we already have a dedicated test just above, so removing it here

rhshadrach

Needs a whatsnew?

pandas/tests/extension/test_string.py

…ng-dtype-comparison-methods-priority

rhshadrach · 2025-05-10T13:28:21Z

@jorisvandenbossche - I've merged main and pushed a commit here. If you have any objections, I can pull it off.

Adds whatsnew to 2.3.
Simplifies conditionals in a test.
Fixes behavior of ArrowExtensionArray and adds tests for it.

For the last one, previously ArrowExtensionArray vs Nan-Python was giving back NumPy bool. This was the only case where ArrowExtensionArray was not resulting in ArrowExtensionArray.

This does not yet handle the case of comparison to object dtype.

object dtype looks correct to me here.

rhshadrach

lgtm, @mroeschke can you have a look.

jorisvandenbossche

@rhshadrach thanks for updating this!

object dtype looks correct to me here.

Hmm, not entirely sure anymore what I meant with that object dtype was not yet covered. I thought maybe the case where the object dtype does not contain just strings, but also that seems to work fine

jorisvandenbossche · 2025-05-14T17:39:55Z

doc/source/whatsnew/v2.3.0.rst

+
+in determining the result dtype when there are different string dtypes compared. Some examples:
+
+- When ``pd.StringDtype("pyarrow", na_value=pd.NA)`` is compared against any other string dtype, the result will always be ``boolean[pyarrow]``.


I think this is not correct, it returns just the nullable boolean dtype? (i.e. pd.BooleanDtype()) Where boolean[pyarrow] is an alias for pd.ArrowDtype(pa.boolean())

Hmm, I see that it is actually the behaviour with this PR as well, but I thought I would have "fixed" that while making things consistent

And I also see that I coded explicitly myself this expected dtype in the tests ...

Given the ordering

object < (python, NaN) < (pyarrow, NaN) < (python, NA) < (pyarrow, NA)

when we compare (pyarrow, NA) with anything we want the result to be as if we compared (pyarrow, NA) with itself, which should result in boolean[pyarrow].

jorisvandenbossche · 2025-05-14T17:55:20Z

object dtype looks correct to me here.

Hmm, not entirely sure anymore what I meant with that object dtype was not yet covered. I thought maybe the case where the object dtype does not contain just strings, but also that seems to work fine

One case related to object dtype that is still failing is comparing with an object series that has mixed types:

In [3]: ser1 = pd.Series(["a", None, "b"], dtype=pd.StringDtype("pyarrow", na_value=np.nan))

In [4]: ser2 = pd.Series(["a", None, 2], dtype=object)

In [5]: ser1 == ser2
...
File ~/scipy/repos/pandas/pandas/core/arrays/arrow/array.py:517, in ArrowExtensionArray._box_pa_array(cls, value, pa_type, copy)
    514     pa_array = pa.array(value, type=pa_type, from_pandas=True)
    515 except (pa.ArrowInvalid, pa.ArrowTypeError):
    516     # GH50430: let pyarrow infer type, then cast
--> 517     pa_array = pa.array(value, from_pandas=True)
    519 if pa_type is None and pa.types.is_duration(pa_array.type):
    520     # Workaround https://github.com/apache/arrow/issues/37291
    521     from pandas.core.tools.timedeltas import to_timedelta
...
ArrowTypeError: Expected bytes, got a 'int' object

In [6]: ser1 = pd.Series(["a", None, "b"], dtype=pd.StringDtype("python", na_value=np.nan))

In [7]: ser1 == ser2
Out[7]: 
0     True
1    False
2    False

So with just object dtype, such a comparison works. And it also works with the python-backed string dtype. But fails with the pyarrow-backed string dtype, because in this case the comparison defers to the ArrowExtensionArray implementation, which tries to convert the other side to a pyarrow array, which is not supported for mixed types. While we generally (although in many cases definitely not best practice) mixed-types object dtype in pandas.

(but let's consider this for a separate issue/PR)

API (string dtype): implement hierarchy (NA > NaN, pyarrow > python) …

3c4d782

…for consistent comparisons between different string dtypes

jorisvandenbossche added the Strings String extension data type and string data label Mar 17, 2025

jorisvandenbossche added this to the 2.3 milestone Mar 17, 2025

jorisvandenbossche added 2 commits March 19, 2025 10:16

fix string arith tests

7ffb08f

fix for build without pyarrow

48907c3

jorisvandenbossche mentioned this pull request Mar 19, 2025

API (string dtype): comparisons between different string classes #60639

Open

jorisvandenbossche added 2 commits March 19, 2025 14:07

fix xfail condition

2058120

fix type annotation

4ebd93b

jorisvandenbossche marked this pull request as ready for review March 19, 2025 16:07

mroeschke reviewed Mar 19, 2025

View reviewed changes

jorisvandenbossche force-pushed the string-dtype-comparison-methods-priority branch from 9a0c382 to 4ebd93b Compare March 19, 2025 18:31

rhshadrach added the Numeric Operations Arithmetic, Comparison, and Logical operations label Mar 23, 2025

jorisvandenbossche added 3 commits March 26, 2025 09:10

Merge remote-tracking branch 'upstream/main' into string-dtype-compar…

33db5d0

…ison-methods-priority

re-add test with list

51340a9

cleanup

e2bfe18

jorisvandenbossche commented Mar 26, 2025

View reviewed changes

jorisvandenbossche requested a review from rhshadrach March 26, 2025 08:23

rhshadrach requested changes Mar 26, 2025

View reviewed changes

pandas/tests/extension/test_string.py Outdated Show resolved Hide resolved

rhshadrach added 2 commits May 10, 2025 08:13

Merge branch 'main' of https://github.com/pandas-dev/pandas into stri…

5ba3577

…ng-dtype-comparison-methods-priority

Fix ArrowExtensionArray and add whatsnew

846afff

fixup

99475e6

rhshadrach added the API - Consistency Internal Consistency of API/Behavior label May 10, 2025

rhshadrach requested a review from mroeschke May 10, 2025 15:13

rhshadrach approved these changes May 10, 2025

View reviewed changes

jorisvandenbossche commented May 14, 2025

View reviewed changes

Merge branch 'main' into string-dtype-comparison-methods-priority

b481d7a

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

API (string dtype): implement hierarchy (NA > NaN, pyarrow > python) for consistent comparisons between different string dtypes #61138

API (string dtype): implement hierarchy (NA > NaN, pyarrow > python) for consistent comparisons between different string dtypes #61138

jorisvandenbossche commented Mar 17, 2025 •

edited

Loading

mroeschke Mar 19, 2025

jorisvandenbossche Mar 26, 2025

jorisvandenbossche Mar 26, 2025

rhshadrach left a comment

rhshadrach commented May 10, 2025

rhshadrach left a comment

jorisvandenbossche left a comment

jorisvandenbossche May 14, 2025

jorisvandenbossche May 14, 2025

jorisvandenbossche May 14, 2025 •

edited

Loading

rhshadrach May 18, 2025 •

edited

Loading

jorisvandenbossche commented May 14, 2025


		in determining the result dtype when there are different string dtypes compared. Some examples:

		- When ``pd.StringDtype("pyarrow", na_value=pd.NA)`` is compared against any other string dtype, the result will always be ``boolean[pyarrow]``.

API (string dtype): implement hierarchy (NA > NaN, pyarrow > python) for consistent comparisons between different string dtypes #61138

Are you sure you want to change the base?

API (string dtype): implement hierarchy (NA > NaN, pyarrow > python) for consistent comparisons between different string dtypes #61138

Conversation

jorisvandenbossche commented Mar 17, 2025 • edited Loading

mroeschke Mar 19, 2025

Choose a reason for hiding this comment

jorisvandenbossche Mar 26, 2025

Choose a reason for hiding this comment

jorisvandenbossche Mar 26, 2025

Choose a reason for hiding this comment

rhshadrach left a comment

Choose a reason for hiding this comment

rhshadrach commented May 10, 2025

rhshadrach left a comment

Choose a reason for hiding this comment

jorisvandenbossche left a comment

Choose a reason for hiding this comment

jorisvandenbossche May 14, 2025

Choose a reason for hiding this comment

jorisvandenbossche May 14, 2025

Choose a reason for hiding this comment

jorisvandenbossche May 14, 2025 • edited Loading

Choose a reason for hiding this comment

rhshadrach May 18, 2025 • edited Loading

Choose a reason for hiding this comment

jorisvandenbossche commented May 14, 2025

jorisvandenbossche commented Mar 17, 2025 •

edited

Loading

jorisvandenbossche May 14, 2025 •

edited

Loading

rhshadrach May 18, 2025 •

edited

Loading