-
-
Notifications
You must be signed in to change notification settings - Fork 18.5k
REF: clearer Categorical/CategoricalIndex construction #24419
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
Hello @topper-123! Thanks for updating the PR. Cheers ! There are no PEP8 issues in this Pull Request. 🍻 Comment last updated on January 03, 2019 at 22:12 Hours UTC |
8f65ded
to
1c57a07
Compare
Codecov Report
@@ Coverage Diff @@
## master #24419 +/- ##
==========================================
+ Coverage 92.3% 92.3% +<.01%
==========================================
Files 163 163
Lines 51950 51948 -2
==========================================
- Hits 47953 47952 -1
+ Misses 3997 3996 -1
Continue to review full report at Codecov.
|
Codecov Report
@@ Coverage Diff @@
## master #24419 +/- ##
==========================================
+ Coverage 92.38% 92.38% +<.01%
==========================================
Files 166 166
Lines 52485 52485
==========================================
+ Hits 48489 48490 +1
+ Misses 3996 3995 -1
Continue to review full report at Codecov.
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can you add some tests specifically around create_categorical_dtype
? Haven't gone over this in it's entirety, but have made some initial comments.
I've adjusted for doc string comments. Will look into the tests part tonight. |
pandas/core/arrays/categorical.py
Outdated
@@ -200,6 +200,71 @@ def contains(cat, key, container): | |||
return any(loc_ in container for loc_ in loc) | |||
|
|||
|
|||
def create_categorical_dtype(values=None, categories=None, ordered=None, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this doesn't belong here at all. Put this in pandas.core.dtypes.dtypes
. But you likely don't need a lot of these checks, which CategoricalDtype already does most of this.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I am ok I think with a free function, though this maybe should just be CategoricalDtype._from_values_or_dtype
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
yes, I like that method name. I've changed it.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this needs to move as indicated
05101b3
to
66382ec
Compare
Tests have been added. |
pandas/core/arrays/categorical.py
Outdated
@@ -200,6 +200,71 @@ def contains(cat, key, container): | |||
return any(loc_ in container for loc_ in loc) | |||
|
|||
|
|||
def create_categorical_dtype(values=None, categories=None, ordered=None, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this needs to move as indicated
@@ -530,3 +531,32 @@ def test_constructor_imaginary(self): | |||
c1 = Categorical(values) | |||
tm.assert_index_equal(c1.categories, Index(values)) | |||
tm.assert_numpy_array_equal(np.array(c1), np.array(values)) | |||
|
|||
|
|||
class TestCreateCategoricalDtype(object): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
these need to move to pandas/tests/dtypes/test_dtypes.py
66382ec
to
11d9ac1
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
also pls run a perf check on category things
pandas/core/dtypes/dtypes.py
Outdated
def _from_values_or_dtype(cls, values=None, categories=None, ordered=None, | ||
dtype=None): | ||
""" | ||
Construct from the inputs used in :class:`Categorical` construction. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
can you change this verbiage a bit. The first sentence is obsolete, this just constructs the dtype.
You an say it doesn't factorize, but its not an 'internal helper method' its a constructor for the dtype.
Wrt. performance. Have some problems getting asv working. Will look into that tomorrow. |
Most time checks are ok, but I get this:
This is for timing slicing a Categoricals with length of 3 million, so I've doubtful if this even matters? |
thanks @topper-123 can you merge master |
@TomAugspurger @jbrockmendel any comments |
LGTM at a glance. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Some small comments
pandas/core/dtypes/dtypes.py
Outdated
ordered : bool, optional | ||
Designating if the categories are ordered. | ||
dtype : CategoricalDtype or the string "category", optional | ||
If ``CategoricalDtype`` cannot be used together with |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think this is clearer without the "If"
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hmm, don't quite agree. Perhaps "If CategoricalDtype
, ..." (notice comma).
pandas/core/dtypes/dtypes.py
Outdated
ValueError: Cannot specify `categories` or `ordered` together with | ||
`dtype`. | ||
|
||
The supplied dtype takes precedence over values's dtype: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
values's
--> values'
pandas/core/dtypes/dtypes.py
Outdated
@@ -408,7 +493,10 @@ def validate_categories(categories, fastpath=False): | |||
""" | |||
from pandas import Index | |||
|
|||
if not isinstance(categories, ABCIndexClass): | |||
if not fastpath and not is_list_like(categories, allow_sets=True): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't think allow_sets=True
needs to be specified since it's the default?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hmm, that's true. I added allow_sets=True, because a test failed. Maybe it was default False untul recently?
Anyway, I'll remove that bit.
pandas/tests/dtypes/test_dtypes.py
Outdated
[c, None, None, dtype2, dtype2], | ||
[c, ['x', 'y'], False, None, dtype2], | ||
]) | ||
def test_create_categorical_dtype( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Since the name has changed from create_categorical_dtype
to _from_values_or_dtype
, does it make sense to rename the test to reflect this, i.e. test_from_values_or_dtype
?
pandas/tests/dtypes/test_dtypes.py
Outdated
[None, ['a', 'b'], None, dtype2], | ||
[None, None, True, dtype2], | ||
]) | ||
def test_create_categorical_dtype_raises(self, values, categories, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
same
pandas/tests/dtypes/test_dtypes.py
Outdated
]) | ||
def test_create_categorical_dtype_raises(self, values, categories, | ||
ordered, | ||
dtype): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can you just have all the parameters on a single line, like what you did with the test above?
pandas/tests/dtypes/test_dtypes.py
Outdated
ordered, | ||
dtype): | ||
msg = "Cannot specify `categories` or `ordered` together with `dtype`." | ||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
blank line can be removed
935c8c1
to
346510e
Compare
Comments from @jschendel have been addressed. |
thanks! |
Construction ATM of
Categorical
andCategoricalIndex
each uses their own ways to construct the dtype. This makes the construction look more complex than it need to.By collecting the dtype construction in a shared function, a lot of things become more simple. An additional advantage is that we can now have a higher confidence in dtype being the same for the same inputs, easing reasoning about the construction phase.
The above is all internal changes, so no whatsnew note is supplied.
A very minor issue was discovered, where an error message for CategoricalDtype made it look like the problem was in
CategoricalIndex
, which is confusing, especially if we're not constructing aCategoricalIndex
.Otherwise the API is unchanged, but the code paths are now much simpler IMO.