Skip to content

REF: clearer Categorical/CategoricalIndex construction #24419

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 6 commits into from
Jan 4, 2019

Conversation

topper-123
Copy link
Contributor

@topper-123 topper-123 commented Dec 25, 2018

Construction ATM of Categorical and CategoricalIndex each uses their own ways to construct the dtype. This makes the construction look more complex than it need to.

By collecting the dtype construction in a shared function, a lot of things become more simple. An additional advantage is that we can now have a higher confidence in dtype being the same for the same inputs, easing reasoning about the construction phase.

The above is all internal changes, so no whatsnew note is supplied.

A very minor issue was discovered, where an error message for CategoricalDtype made it look like the problem was in CategoricalIndex, which is confusing, especially if we're not constructing a CategoricalIndex.

>>> pd.api.type.CategoricalDtype('category')
TypeError: CategoricalIndex(...) must be called with a collection of some kind, 'category' was passed  # master
TypeError: Parameter 'categories' must be list-like, was 'category'  # this PR

Otherwise the API is unchanged, but the code paths are now much simpler IMO.

@pep8speaks
Copy link

pep8speaks commented Dec 25, 2018

Hello @topper-123! Thanks for updating the PR.

Cheers ! There are no PEP8 issues in this Pull Request. 🍻

Comment last updated on January 03, 2019 at 22:12 Hours UTC

@topper-123 topper-123 force-pushed the cateorical_refactor branch 3 times, most recently from 8f65ded to 1c57a07 Compare December 25, 2018 01:33
@codecov
Copy link

codecov bot commented Dec 25, 2018

Codecov Report

Merging #24419 into master will increase coverage by <.01%.
The diff coverage is 100%.

Impacted file tree graph

@@            Coverage Diff             @@
##           master   #24419      +/-   ##
==========================================
+ Coverage    92.3%    92.3%   +<.01%     
==========================================
  Files         163      163              
  Lines       51950    51948       -2     
==========================================
- Hits        47953    47952       -1     
+ Misses       3997     3996       -1
Flag Coverage Δ
#multiple 90.71% <100%> (-0.01%) ⬇️
#single 43% <81.25%> (ø) ⬆️
Impacted Files Coverage Δ
pandas/core/dtypes/dtypes.py 95.38% <100%> (+0.04%) ⬆️
pandas/core/indexes/category.py 98.61% <100%> (-0.04%) ⬇️
pandas/core/arrays/categorical.py 95.33% <100%> (+0.02%) ⬆️
pandas/util/testing.py 87.84% <0%> (+0.09%) ⬆️

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 159772d...1c57a07. Read the comment docs.

@codecov
Copy link

codecov bot commented Dec 25, 2018

Codecov Report

Merging #24419 into master will increase coverage by <.01%.
The diff coverage is 100%.

Impacted file tree graph

@@            Coverage Diff             @@
##           master   #24419      +/-   ##
==========================================
+ Coverage   92.38%   92.38%   +<.01%     
==========================================
  Files         166      166              
  Lines       52485    52485              
==========================================
+ Hits        48489    48490       +1     
+ Misses       3996     3995       -1
Flag Coverage Δ
#multiple 90.81% <100%> (ø) ⬆️
#single 43.05% <81.81%> (ø) ⬆️
Impacted Files Coverage Δ
pandas/core/dtypes/dtypes.py 95.55% <100%> (+0.21%) ⬆️
pandas/core/indexes/category.py 98.61% <100%> (-0.04%) ⬇️
pandas/core/arrays/categorical.py 95.41% <100%> (-0.06%) ⬇️
pandas/util/testing.py 88.09% <0%> (+0.09%) ⬆️

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update c9a0405...346510e. Read the comment docs.

@jschendel jschendel added Refactor Internal refactoring of code Categorical Categorical Data Type labels Dec 25, 2018
Copy link
Member

@jschendel jschendel left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you add some tests specifically around create_categorical_dtype? Haven't gone over this in it's entirety, but have made some initial comments.

@topper-123
Copy link
Contributor Author

I've adjusted for doc string comments. Will look into the tests part tonight.

@@ -200,6 +200,71 @@ def contains(cat, key, container):
return any(loc_ in container for loc_ in loc)


def create_categorical_dtype(values=None, categories=None, ordered=None,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this doesn't belong here at all. Put this in pandas.core.dtypes.dtypes. But you likely don't need a lot of these checks, which CategoricalDtype already does most of this.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I am ok I think with a free function, though this maybe should just be CategoricalDtype._from_values_or_dtype

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yes, I like that method name. I've changed it.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this needs to move as indicated

@topper-123 topper-123 force-pushed the cateorical_refactor branch 2 times, most recently from 05101b3 to 66382ec Compare December 25, 2018 22:19
@topper-123
Copy link
Contributor Author

Tests have been added.

@@ -200,6 +200,71 @@ def contains(cat, key, container):
return any(loc_ in container for loc_ in loc)


def create_categorical_dtype(values=None, categories=None, ordered=None,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this needs to move as indicated

@@ -530,3 +531,32 @@ def test_constructor_imaginary(self):
c1 = Categorical(values)
tm.assert_index_equal(c1.categories, Index(values))
tm.assert_numpy_array_equal(np.array(c1), np.array(values))


class TestCreateCategoricalDtype(object):
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

these need to move to pandas/tests/dtypes/test_dtypes.py

Copy link
Contributor

@jreback jreback left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

also pls run a perf check on category things

def _from_values_or_dtype(cls, values=None, categories=None, ordered=None,
dtype=None):
"""
Construct from the inputs used in :class:`Categorical` construction.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can you change this verbiage a bit. The first sentence is obsolete, this just constructs the dtype.

You an say it doesn't factorize, but its not an 'internal helper method' its a constructor for the dtype.

@topper-123
Copy link
Contributor Author

Wrt. performance. Have some problems getting asv working. Will look into that tomorrow.

@jreback jreback added this to the 0.24.0 milestone Jan 1, 2019
@topper-123
Copy link
Contributor Author

Most time checks are ok, but I get this:

       before           after         ratio
     [b49136eb]       [935c8c16]
     <master>         <cateorical_refactor>
+        6.25±0μs       10.9±0.8μs     1.75  categoricals.CategoricalSlicing.time_getitem_slice('monotonic_decr')
+        6.25±0μs       10.9±0.6μs     1.75  categoricals.CategoricalSlicing.time_getitem_slice('monotonic_incr')

This is for timing slicing a Categoricals with length of 3 million, so I've doubtful if this even matters?

@jreback
Copy link
Contributor

jreback commented Jan 3, 2019

thanks @topper-123 can you merge master

@jreback
Copy link
Contributor

jreback commented Jan 3, 2019

@TomAugspurger @jbrockmendel any comments

@TomAugspurger
Copy link
Contributor

LGTM at a glance.

Copy link
Member

@jschendel jschendel left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Some small comments

ordered : bool, optional
Designating if the categories are ordered.
dtype : CategoricalDtype or the string "category", optional
If ``CategoricalDtype`` cannot be used together with
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this is clearer without the "If"

Copy link
Contributor Author

@topper-123 topper-123 Jan 3, 2019

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hmm, don't quite agree. Perhaps "If CategoricalDtype, ..." (notice comma).

ValueError: Cannot specify `categories` or `ordered` together with
`dtype`.

The supplied dtype takes precedence over values's dtype:
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

values's --> values'

@@ -408,7 +493,10 @@ def validate_categories(categories, fastpath=False):
"""
from pandas import Index

if not isinstance(categories, ABCIndexClass):
if not fastpath and not is_list_like(categories, allow_sets=True):
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think allow_sets=True needs to be specified since it's the default?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hmm, that's true. I added allow_sets=True, because a test failed. Maybe it was default False untul recently?

Anyway, I'll remove that bit.

[c, None, None, dtype2, dtype2],
[c, ['x', 'y'], False, None, dtype2],
])
def test_create_categorical_dtype(
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Since the name has changed from create_categorical_dtype to _from_values_or_dtype, does it make sense to rename the test to reflect this, i.e. test_from_values_or_dtype?

[None, ['a', 'b'], None, dtype2],
[None, None, True, dtype2],
])
def test_create_categorical_dtype_raises(self, values, categories,
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

same

])
def test_create_categorical_dtype_raises(self, values, categories,
ordered,
dtype):
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you just have all the parameters on a single line, like what you did with the test above?

ordered,
dtype):
msg = "Cannot specify `categories` or `ordered` together with `dtype`."

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

blank line can be removed

@topper-123 topper-123 force-pushed the cateorical_refactor branch from 935c8c1 to 346510e Compare January 3, 2019 22:12
@topper-123
Copy link
Contributor Author

Comments from @jschendel have been addressed.

@jreback jreback merged commit c5166b6 into pandas-dev:master Jan 4, 2019
@jreback
Copy link
Contributor

jreback commented Jan 4, 2019

thanks!

@topper-123 topper-123 deleted the cateorical_refactor branch January 4, 2019 12:29
Pingviinituutti pushed a commit to Pingviinituutti/pandas that referenced this pull request Feb 28, 2019
Pingviinituutti pushed a commit to Pingviinituutti/pandas that referenced this pull request Feb 28, 2019
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Categorical Categorical Data Type Refactor Internal refactoring of code
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants