-
-
Notifications
You must be signed in to change notification settings - Fork 18.5k
API (string dtype): implement hierarchy (NA > NaN, pyarrow > python) for consistent comparisons between different string dtypes #61138
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
…for consistent comparisons between different string dtypes
expected = pd.array([None, None, None], dtype=expected_dtype) | ||
tm.assert_extension_array_equal(result, expected) | ||
# # with list | ||
# other = [None, None, "c"] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Did you want to implement testing this in this PR?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, this was already implemented, just need to add this case back to the test. The original "array" test was actually testing with a list. I updated the test to now actually use an array (parametrized with all the different dtypes, to get all combinations of dtypes in both operands), and added a separate test with just the list.
9a0c382
to
4ebd93b
Compare
result = getattr(a, op_name)(pd.NA) | ||
expected = pd.array([None, None, None], dtype=expected_dtype) | ||
tm.assert_extension_array_equal(result, expected) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
For this case of comparing with NA, we already have a dedicated test just above, so removing it here
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Needs a whatsnew?
…ng-dtype-comparison-methods-priority
@jorisvandenbossche - I've merged main and pushed a commit here. If you have any objections, I can pull it off.
For the last one, previously ArrowExtensionArray vs Nan-Python was giving back NumPy bool. This was the only case where ArrowExtensionArray was not resulting in ArrowExtensionArray.
object dtype looks correct to me here. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
lgtm, @mroeschke can you have a look.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@rhshadrach thanks for updating this!
object dtype looks correct to me here.
Hmm, not entirely sure anymore what I meant with that object dtype was not yet covered. I thought maybe the case where the object dtype does not contain just strings, but also that seems to work fine
|
||
in determining the result dtype when there are different string dtypes compared. Some examples: | ||
|
||
- When ``pd.StringDtype("pyarrow", na_value=pd.NA)`` is compared against any other string dtype, the result will always be ``boolean[pyarrow]``. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think this is not correct, it returns just the nullable boolean
dtype? (i.e. pd.BooleanDtype()
) Where boolean[pyarrow]
is an alias for pd.ArrowDtype(pa.boolean())
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hmm, I see that it is actually the behaviour with this PR as well, but I thought I would have "fixed" that while making things consistent
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
And I also see that I coded explicitly myself this expected dtype in the tests ...
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Given the ordering
object < (python, NaN) < (pyarrow, NaN) < (python, NA) < (pyarrow, NA)
when we compare (pyarrow, NA)
with anything we want the result to be as if we compared (pyarrow, NA)
with itself, which should result in boolean[pyarrow]
.
One case related to object dtype that is still failing is comparing with an object series that has mixed types: In [3]: ser1 = pd.Series(["a", None, "b"], dtype=pd.StringDtype("pyarrow", na_value=np.nan))
In [4]: ser2 = pd.Series(["a", None, 2], dtype=object)
In [5]: ser1 == ser2
...
File ~/scipy/repos/pandas/pandas/core/arrays/arrow/array.py:517, in ArrowExtensionArray._box_pa_array(cls, value, pa_type, copy)
514 pa_array = pa.array(value, type=pa_type, from_pandas=True)
515 except (pa.ArrowInvalid, pa.ArrowTypeError):
516 # GH50430: let pyarrow infer type, then cast
--> 517 pa_array = pa.array(value, from_pandas=True)
519 if pa_type is None and pa.types.is_duration(pa_array.type):
520 # Workaround https://github.com/apache/arrow/issues/37291
521 from pandas.core.tools.timedeltas import to_timedelta
...
ArrowTypeError: Expected bytes, got a 'int' object
In [6]: ser1 = pd.Series(["a", None, "b"], dtype=pd.StringDtype("python", na_value=np.nan))
In [7]: ser1 == ser2
Out[7]:
0 True
1 False
2 False So with just object dtype, such a comparison works. And it also works with the python-backed string dtype. But fails with the pyarrow-backed string dtype, because in this case the comparison defers to the ArrowExtensionArray implementation, which tries to convert the other side to a pyarrow array, which is not supported for mixed types. While we generally (although in many cases definitely not best practice) mixed-types object dtype in pandas. (but let's consider this for a separate issue/PR) |
Thank @jorisvandenbossche and @rhshadrach |
Owee, I'm MrMeeseeks, Look at me. There seem to be a conflict, please backport manually. Here are approximate instructions:
And apply the correct labels and milestones. Congratulations — you did some good work! Hopefully your backport PR will be tested by the continuous integration and merged soon! Remember to remove the If these instructions are inaccurate, feel free to suggest an improvement. |
Closes #60639
This does not yet handle the case of comparison to object dtype.
doc/source/whatsnew/vX.X.X.rst
file if fixing a bug or adding a new feature.