-
-
Notifications
You must be signed in to change notification settings - Fork 18.5k
BUG: Fix construction of Categorical from pd.NA #31939
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Changes from 5 commits
99dbff4
81516a6
78d62f9
38fede6
52466ab
1a71728
bad5be3
b051bf0
563b673
9066789
7da4e44
d1a953b
baab1d5
062f5f7
2d45b21
14a737d
f0eb9f3
a54fe0d
17de660
78e38ec
0efcdb0
a04df9b
3c5082e
d50f963
File filter
Filter by extension
Conversations
Jump to
Diff view
Diff view
There are no files selected for viewing
Original file line number | Diff line number | Diff line change |
---|---|---|
|
@@ -458,6 +458,14 @@ def test_constructor_with_categorical_categories(self): | |
result = Categorical(["a", "b"], categories=CategoricalIndex(["a", "b", "c"])) | ||
tm.assert_categorical_equal(result, expected) | ||
|
||
def test_construction_with_na(self): | ||
dsaxton marked this conversation as resolved.
Show resolved
Hide resolved
|
||
# https://github.com/pandas-dev/pandas/issues/31927 | ||
values = ["a", pd.NA] | ||
result = Categorical(np.array(values, dtype=object)) | ||
expected = Categorical(values) | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. I am not sure this is a very good test. I mean: it is testing that lists vs object array are giving the same result (which is useful anyhow, as those should be consistent), but it is not testing how they are now constructed (eg it won't "preserve" pd.NA, and this is also not tested) There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. @dsaxton can you parameterize this on klass (np.array and list), then hard code the results in a categorical (meaning use _from_codes and an explict list of categories) |
||
|
||
tm.assert_categorical_equal(result, expected) | ||
|
||
def test_from_codes(self): | ||
|
||
# too few categories | ||
|
Original file line number | Diff line number | Diff line change |
---|---|---|
|
@@ -111,3 +111,13 @@ def test_nested_tuples_duplicates(self): | |
df3 = df.copy(deep=True) | ||
df3.loc[[(dti[0], "a")], "c2"] = 1.0 | ||
tm.assert_frame_equal(df3, expected) | ||
|
||
def test_multiindex_from_product_contains_na(self): | ||
# https://github.com/pandas-dev/pandas/issues/31883 | ||
values1 = [np.array([0.0, pd.NA], dtype="object"), ["a", "b"]] | ||
values2 = [np.array([0.0, np.nan], dtype="object"), ["a", "b"]] | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. wait, pd.NA is actually converted to np.nan here? There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Yes, probably not ideal (but better than an error). If merged would a follow-up issue to make sure pd.NA is used make sense? Or I could mark that the referenced issue is not actually closed and comment there. There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. no, pls do it here There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. followups are ok, but for relatively small things just fixing it in the same PR is better There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. I'd have to look more closely, but I'm not sure if having it return pd.NA instead of np.nan is an easy fix; this is already how it behaves for list input (which seems to be the documented behavior): In [1]: import pandas as pd
...:
...: values = ["a", pd.NA]
...:
...: pd.Categorical(values)
...:
Out[1]:
[a, NaN]
Categories (1, object): [a]
In [2]: pd.__version__
Out[2]: '1.0.1' I think having it so that we at least get the same output and not an error for a numpy array with object dtype is still an improvement though? What are your thoughts @WillAyd ? There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Agree it would be nice to maintain pd.NA - do you know the extra effort involved to do so? There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. I'm not sure, I'd need to investigate a bit more. The logic doesn't seem too obvious though; should There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Curious if @jorisvandenbossche has a preference? Always using |
||
|
||
result = pd.MultiIndex.from_product(values1) | ||
expected = pd.MultiIndex.from_product(values2) | ||
dsaxton marked this conversation as resolved.
Show resolved
Hide resolved
|
||
|
||
tm.assert_index_equal(result, expected) |
Uh oh!
There was an error while loading. Please reload this page.