REF: clearer Categorical/CategoricalIndex construction #24419

topper-123 · 2018-12-25T00:44:53Z

Construction ATM of Categorical and CategoricalIndex each uses their own ways to construct the dtype. This makes the construction look more complex than it need to.

By collecting the dtype construction in a shared function, a lot of things become more simple. An additional advantage is that we can now have a higher confidence in dtype being the same for the same inputs, easing reasoning about the construction phase.

The above is all internal changes, so no whatsnew note is supplied.

A very minor issue was discovered, where an error message for CategoricalDtype made it look like the problem was in CategoricalIndex, which is confusing, especially if we're not constructing a CategoricalIndex.

>>> pd.api.type.CategoricalDtype('category')
TypeError: CategoricalIndex(...) must be called with a collection of some kind, 'category' was passed  # master
TypeError: Parameter 'categories' must be list-like, was 'category'  # this PR

Otherwise the API is unchanged, but the code paths are now much simpler IMO.

pep8speaks · 2018-12-25T00:44:58Z

Hello @topper-123! Thanks for updating the PR.

Cheers ! There are no PEP8 issues in this Pull Request. 🍻

Comment last updated on January 03, 2019 at 22:12 Hours UTC

codecov · 2018-12-25T01:57:28Z

Codecov Report

Merging #24419 into master will increase coverage by <.01%.
The diff coverage is 100%.

@@            Coverage Diff             @@
##           master   #24419      +/-   ##
==========================================
+ Coverage    92.3%    92.3%   +<.01%     
==========================================
  Files         163      163              
  Lines       51950    51948       -2     
==========================================
- Hits        47953    47952       -1     
+ Misses       3997     3996       -1

Flag	Coverage Δ
#multiple	`90.71% <100%> (-0.01%)`	⬇️
#single	`43% <81.25%> (ø)`	⬆️

Impacted Files	Coverage Δ
pandas/core/dtypes/dtypes.py	`95.38% <100%> (+0.04%)`	⬆️
pandas/core/indexes/category.py	`98.61% <100%> (-0.04%)`	⬇️
pandas/core/arrays/categorical.py	`95.33% <100%> (+0.02%)`	⬆️
pandas/util/testing.py	`87.84% <0%> (+0.09%)`	⬆️

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 159772d...1c57a07. Read the comment docs.

codecov · 2018-12-25T01:57:29Z

Codecov Report

Merging #24419 into master will increase coverage by <.01%.
The diff coverage is 100%.

@@            Coverage Diff             @@
##           master   #24419      +/-   ##
==========================================
+ Coverage   92.38%   92.38%   +<.01%     
==========================================
  Files         166      166              
  Lines       52485    52485              
==========================================
+ Hits        48489    48490       +1     
+ Misses       3996     3995       -1

Flag	Coverage Δ
#multiple	`90.81% <100%> (ø)`	⬆️
#single	`43.05% <81.81%> (ø)`	⬆️

Impacted Files	Coverage Δ
pandas/core/dtypes/dtypes.py	`95.55% <100%> (+0.21%)`	⬆️
pandas/core/indexes/category.py	`98.61% <100%> (-0.04%)`	⬇️
pandas/core/arrays/categorical.py	`95.41% <100%> (-0.06%)`	⬇️
pandas/util/testing.py	`88.09% <0%> (+0.09%)`	⬆️

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update c9a0405...346510e. Read the comment docs.

jschendel

Can you add some tests specifically around create_categorical_dtype? Haven't gone over this in it's entirety, but have made some initial comments.

pandas/core/arrays/categorical.py

topper-123 · 2018-12-25T09:42:42Z

I've adjusted for doc string comments. Will look into the tests part tonight.

jreback · 2018-12-25T17:24:10Z

pandas/core/arrays/categorical.py

@@ -200,6 +200,71 @@ def contains(cat, key, container):
        return any(loc_ in container for loc_ in loc)


+def create_categorical_dtype(values=None, categories=None, ordered=None,


this doesn't belong here at all. Put this in pandas.core.dtypes.dtypes. But you likely don't need a lot of these checks, which CategoricalDtype already does most of this.

I am ok I think with a free function, though this maybe should just be CategoricalDtype._from_values_or_dtype

yes, I like that method name. I've changed it.

this needs to move as indicated

pandas/core/dtypes/dtypes.py

topper-123 · 2018-12-25T23:27:16Z

Tests have been added.

jreback · 2018-12-30T22:50:26Z

pandas/core/arrays/categorical.py

@@ -200,6 +200,71 @@ def contains(cat, key, container):
        return any(loc_ in container for loc_ in loc)


+def create_categorical_dtype(values=None, categories=None, ordered=None,


this needs to move as indicated

jreback · 2018-12-30T22:51:36Z

pandas/tests/arrays/categorical/test_constructors.py

@@ -530,3 +531,32 @@ def test_constructor_imaginary(self):
        c1 = Categorical(values)
        tm.assert_index_equal(c1.categories, Index(values))
        tm.assert_numpy_array_equal(np.array(c1), np.array(values))
+
+
+class TestCreateCategoricalDtype(object):


these need to move to pandas/tests/dtypes/test_dtypes.py

jreback

also pls run a perf check on category things

pandas/core/arrays/categorical.py

jreback · 2019-01-01T00:17:10Z

pandas/core/dtypes/dtypes.py

+    def _from_values_or_dtype(cls, values=None, categories=None, ordered=None,
+                              dtype=None):
+        """
+        Construct from the inputs used in :class:`Categorical` construction.


can you change this verbiage a bit. The first sentence is obsolete, this just constructs the dtype.

You an say it doesn't factorize, but its not an 'internal helper method' its a constructor for the dtype.

pandas/core/dtypes/dtypes.py

topper-123 · 2019-01-01T02:00:07Z

Wrt. performance. Have some problems getting asv working. Will look into that tomorrow.

topper-123 · 2019-01-03T19:24:49Z

Most time checks are ok, but I get this:

       before           after         ratio
     [b49136eb]       [935c8c16]
     <master>         <cateorical_refactor>
+        6.25±0μs       10.9±0.8μs     1.75  categoricals.CategoricalSlicing.time_getitem_slice('monotonic_decr')
+        6.25±0μs       10.9±0.6μs     1.75  categoricals.CategoricalSlicing.time_getitem_slice('monotonic_incr')

This is for timing slicing a Categoricals with length of 3 million, so I've doubtful if this even matters?

jreback · 2019-01-03T19:30:22Z

thanks @topper-123 can you merge master

jreback · 2019-01-03T19:30:35Z

@TomAugspurger @jbrockmendel any comments

TomAugspurger · 2019-01-03T19:37:33Z

LGTM at a glance.

jschendel

Some small comments

jschendel · 2019-01-03T19:50:47Z

pandas/core/dtypes/dtypes.py

+        ordered : bool, optional
+            Designating if the categories are ordered.
+        dtype : CategoricalDtype or the string "category", optional
+            If ``CategoricalDtype`` cannot be used together with


I think this is clearer without the "If"

Hmm, don't quite agree. Perhaps "If CategoricalDtype, ..." (notice comma).

jschendel · 2019-01-03T19:53:25Z

pandas/core/dtypes/dtypes.py

+        ValueError: Cannot specify `categories` or `ordered` together with
+        `dtype`.
+
+        The supplied dtype takes precedence over values's dtype:


values's --> values'

jschendel · 2019-01-03T19:54:14Z

pandas/core/dtypes/dtypes.py

@@ -408,7 +493,10 @@ def validate_categories(categories, fastpath=False):
        """
        from pandas import Index

-        if not isinstance(categories, ABCIndexClass):
+        if not fastpath and not is_list_like(categories, allow_sets=True):


I don't think allow_sets=True needs to be specified since it's the default?

Hmm, that's true. I added allow_sets=True, because a test failed. Maybe it was default False untul recently?

Anyway, I'll remove that bit.

jschendel · 2019-01-03T19:56:51Z

pandas/tests/dtypes/test_dtypes.py

+                                 [c, None, None, dtype2, dtype2],
+                                 [c, ['x', 'y'], False, None, dtype2],
+                             ])
+    def test_create_categorical_dtype(


Since the name has changed from create_categorical_dtype to _from_values_or_dtype, does it make sense to rename the test to reflect this, i.e. test_from_values_or_dtype?

jschendel · 2019-01-03T19:57:37Z

pandas/tests/dtypes/test_dtypes.py

+        [None, ['a', 'b'], None, dtype2],
+        [None, None, True, dtype2],
+    ])
+    def test_create_categorical_dtype_raises(self, values, categories,


jschendel · 2019-01-03T20:00:04Z

pandas/tests/dtypes/test_dtypes.py

+    ])
+    def test_create_categorical_dtype_raises(self, values, categories,
+                                             ordered,
+                                             dtype):


Can you just have all the parameters on a single line, like what you did with the test above?

jschendel · 2019-01-03T20:00:13Z

pandas/tests/dtypes/test_dtypes.py

+                                             ordered,
+                                             dtype):
+        msg = "Cannot specify `categories` or `ordered` together with `dtype`."
+


blank line can be removed

topper-123 · 2019-01-04T08:48:19Z

Comments from @jschendel have been addressed.

jreback · 2019-01-04T12:14:29Z

thanks!

)

topper-123 force-pushed the cateorical_refactor branch 3 times, most recently from 8f65ded to 1c57a07 Compare December 25, 2018 01:33

jschendel added Refactor Internal refactoring of code Categorical Categorical Data Type labels Dec 25, 2018

jschendel reviewed Dec 25, 2018

View reviewed changes

jreback requested changes Dec 25, 2018

View reviewed changes

topper-123 force-pushed the cateorical_refactor branch 2 times, most recently from 05101b3 to 66382ec Compare December 25, 2018 22:19

jreback requested changes Dec 30, 2018

View reviewed changes

topper-123 force-pushed the cateorical_refactor branch from 66382ec to 11d9ac1 Compare December 31, 2018 23:33

jreback requested changes Jan 1, 2019

View reviewed changes

jreback added this to the 0.24.0 milestone Jan 1, 2019

jreback mentioned this pull request Jan 1, 2019

API: Add dtype parameter to Categorical.from_codes #24398

Merged

3 tasks

jschendel reviewed Jan 3, 2019

View reviewed changes

topper-123 added 5 commits January 3, 2019 22:11

Improve error messages

65c66b0

REF: clearer construction of Categorical/CategoricalIndex

60f7bec

Change doc string according to comments

7252322

move new constructor to dtypes/dtypes.py

e33dcee

adjust doc string

51d363a

change according to comments

346510e

topper-123 force-pushed the cateorical_refactor branch from 935c8c1 to 346510e Compare January 3, 2019 22:12

jreback approved these changes Jan 4, 2019

View reviewed changes

jreback merged commit c5166b6 into pandas-dev:master Jan 4, 2019

topper-123 deleted the cateorical_refactor branch January 4, 2019 12:29

Pingviinituutti pushed a commit to Pingviinituutti/pandas that referenced this pull request Feb 28, 2019

REF: clearer Categorical/CategoricalIndex construction (pandas-dev#24419

af561c4

)

Pingviinituutti pushed a commit to Pingviinituutti/pandas that referenced this pull request Feb 28, 2019

REF: clearer Categorical/CategoricalIndex construction (pandas-dev#24419

747639d

)

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

REF: clearer Categorical/CategoricalIndex construction #24419

REF: clearer Categorical/CategoricalIndex construction #24419

topper-123 commented Dec 25, 2018 •

edited

Loading

pep8speaks commented Dec 25, 2018 •

edited

Loading

codecov bot commented Dec 25, 2018

codecov bot commented Dec 25, 2018 •

edited

Loading

jschendel left a comment

topper-123 commented Dec 25, 2018

jreback Dec 25, 2018

jreback Dec 25, 2018

topper-123 Dec 26, 2018

jreback Dec 30, 2018

topper-123 commented Dec 25, 2018

jreback Dec 30, 2018

jreback Dec 30, 2018

jreback left a comment

jreback Jan 1, 2019

topper-123 commented Jan 1, 2019

topper-123 commented Jan 3, 2019

jreback commented Jan 3, 2019

jreback commented Jan 3, 2019

TomAugspurger commented Jan 3, 2019

jschendel left a comment

jschendel Jan 3, 2019

topper-123 Jan 3, 2019 •

edited

Loading

jschendel Jan 3, 2019

jschendel Jan 3, 2019

topper-123 Jan 3, 2019

jschendel Jan 3, 2019

jschendel Jan 3, 2019

jschendel Jan 3, 2019

jschendel Jan 3, 2019

topper-123 commented Jan 4, 2019

jreback commented Jan 4, 2019

		@@ -200,6 +200,71 @@ def contains(cat, key, container):
		return any(loc_ in container for loc_ in loc)


		def create_categorical_dtype(values=None, categories=None, ordered=None,

REF: clearer Categorical/CategoricalIndex construction #24419

REF: clearer Categorical/CategoricalIndex construction #24419

Conversation

topper-123 commented Dec 25, 2018 • edited Loading

pep8speaks commented Dec 25, 2018 • edited Loading

Comment last updated on January 03, 2019 at 22:12 Hours UTC

codecov bot commented Dec 25, 2018

Codecov Report

codecov bot commented Dec 25, 2018 • edited Loading

Codecov Report

jschendel left a comment

Choose a reason for hiding this comment

topper-123 commented Dec 25, 2018

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

topper-123 commented Dec 25, 2018

Choose a reason for hiding this comment

Choose a reason for hiding this comment

jreback left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

topper-123 commented Jan 1, 2019

topper-123 commented Jan 3, 2019

jreback commented Jan 3, 2019

jreback commented Jan 3, 2019

TomAugspurger commented Jan 3, 2019

jschendel left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

topper-123 Jan 3, 2019 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

topper-123 commented Jan 4, 2019

jreback commented Jan 4, 2019

topper-123 commented Dec 25, 2018 •

edited

Loading

pep8speaks commented Dec 25, 2018 •

edited

Loading

codecov bot commented Dec 25, 2018 •

edited

Loading

topper-123 Jan 3, 2019 •

edited

Loading