-
-
Notifications
You must be signed in to change notification settings - Fork 18.6k
Categoricals hash consistently #15143
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Changes from 1 commit
File filter
Filter by extension
Conversations
Jump to
Diff view
Diff view
There are no files selected for viewing
Original file line number | Diff line number | Diff line change |
---|---|---|
|
@@ -4,7 +4,7 @@ | |
|
||
import numpy as np | ||
from pandas import _hash, Series, factorize, Categorical, Index | ||
from pandas.lib import infer_dtype | ||
from pandas.lib import is_bool_array | ||
from pandas.types.generic import ABCIndexClass, ABCSeries, ABCDataFrame | ||
from pandas.types.common import is_categorical_dtype | ||
|
||
|
@@ -68,7 +68,7 @@ def adder(h, hashed_to_add): | |
return h | ||
|
||
|
||
def hash_array(vals, encoding='utf8', hash_key=None): | ||
def hash_array(vals, encoding='utf8', hash_key=None, categorize=True): | ||
""" | ||
Given a 1d array, return an array of deterministic integers. | ||
|
||
|
@@ -80,53 +80,56 @@ def hash_array(vals, encoding='utf8', hash_key=None): | |
encoding : string, default 'utf8' | ||
encoding for data & key when strings | ||
hash_key : string key to encode, default to _default_hash_key | ||
categorize : bool, default True | ||
Whether to first categorize object arrays before hashing. This is more | ||
efficient when the array contains duplicate values. | ||
|
||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. add a versionadded tag here (0.20.0) There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. I think also add |
||
Returns | ||
------- | ||
1d uint64 numpy array of hash values, same length as the vals | ||
|
||
""" | ||
|
||
# work with cagegoricals as ints. (This check is above the complex | ||
# check so that we don't ask numpy if categorical is a subdtype of | ||
# complex, as it will choke. | ||
if hash_key is None: | ||
hash_key = _default_hash_key | ||
|
||
# For categoricals, we hash the categories, then remap the codes to the | ||
# hash values. (This check is above the complex check so that we don't ask | ||
# numpy if categorical is a subdtype of complex, as it will choke. | ||
if is_categorical_dtype(vals.dtype): | ||
vals = vals.codes | ||
cat_hashed = hash_array(vals.categories.values, encoding, hash_key, | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. do this like below instead, rather than adding a keyword (where we do basically the same thing) There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. I'm not sure I follow - we need to recurse through in case the values aren't objects. So we need to be able to handle There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. maybe show what the problem is this code is too complicated There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Can you explain where you mean "like we do below"? I'm assuming you mean lines 125-130, where we handle object dtype. For categoricals, what we want to do is:
I don't think adding this extra keyword overly complicates things, and do think this is the simplest way to do this. I may not be understanding what you're trying to suggest here - perhaps if you could explain a bit better I might get it. There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. I looked again. This is reasonable. Ideally don't want to repeat the factorize-hash-remap logic as right now this code looks very similar to what you have. maybe pull this out to a common helper function? I get the problem is now That said not averse to allow a caller to do this (as dask may have more information that pandas, e.g. the cardinaility of the set, or even the uniques already?) There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more.
Not quite. The |
||
categorize=False).astype(np.uint64, copy=False) | ||
# Since `cat_hashed` is already distributed in the space of uint64s, | ||
# we can just return after remapping the codes here | ||
c = Series(vals) | ||
return c.cat.rename_categories(cat_hashed).values.astype(np.uint64) | ||
|
||
# we'll be working with everything as 64-bit values, so handle this | ||
# 128-bit value early | ||
if np.issubdtype(vals.dtype, np.complex128): | ||
return hash_array(vals.real) + 23 * hash_array(vals.imag) | ||
|
||
# MAIN LOGIC: | ||
inferred = infer_dtype(vals) | ||
|
||
# First, turn whatever array this is into unsigned 64-bit ints, if we can | ||
# manage it. | ||
if inferred == 'boolean': | ||
if is_bool_array(vals): | ||
vals = vals.astype('u8') | ||
|
||
if (np.issubdtype(vals.dtype, np.datetime64) or | ||
np.issubdtype(vals.dtype, np.timedelta64) or | ||
np.issubdtype(vals.dtype, np.number)) and vals.dtype.itemsize <= 8: | ||
|
||
elif (np.issubdtype(vals.dtype, np.datetime64) or | ||
np.issubdtype(vals.dtype, np.timedelta64) or | ||
np.issubdtype(vals.dtype, np.number)) and vals.dtype.itemsize <= 8: | ||
vals = vals.view('u{}'.format(vals.dtype.itemsize)).astype('u8') | ||
else: | ||
|
||
# its MUCH faster to categorize object dtypes, then hash and rename | ||
codes, categories = factorize(vals, sort=False) | ||
categories = Index(categories) | ||
c = Series(Categorical(codes, categories, | ||
ordered=False, fastpath=True)) | ||
vals = _hash.hash_object_array(categories.values, | ||
hash_key, | ||
encoding) | ||
|
||
# rename & extract | ||
vals = c.cat.rename_categories(Index(vals)).astype(np.uint64).values | ||
# With repeated values, its MUCH faster to categorize object dtypes, | ||
# then hash and rename categories. We allow skipping the categorization | ||
# when the values are known/likely to be unique. | ||
if categorize: | ||
codes, categories = factorize(vals, sort=False) | ||
c = Series(Categorical(codes, Index(categories), | ||
ordered=False, fastpath=True)) | ||
vals = _hash.hash_object_array(categories, hash_key, encoding) | ||
# rename & extract | ||
vals = c.cat.rename_categories(vals).values.astype(np.uint64) | ||
else: | ||
vals = _hash.hash_object_array(vals, hash_key, encoding) | ||
|
||
# Then, redistribute these 64-bit ints within the space of 64-bit ints | ||
vals ^= vals >> 30 | ||
|
Original file line number | Diff line number | Diff line change |
---|---|---|
|
@@ -90,6 +90,20 @@ def test_hash_pandas_empty_object(self): | |
# these are by-definition the same with | ||
# or w/o the index as the data is empty | ||
|
||
def test_categorical_consistency(self): | ||
# Check that categoricals hash consistent with their values, not codes | ||
# This should work for categoricals of any dtype | ||
for data in [['a', 'b', 'c', 'd'], [1000, 2000, 3000, 4000]]: | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. can you test with datetimes as well |
||
s1 = Series(data) | ||
s2 = s1.astype('category').cat.set_categories(data) | ||
s3 = s2.cat.set_categories(list(reversed(data))) | ||
# These should all hash identically | ||
h1 = hash_pandas_object(s1) | ||
h2 = hash_pandas_object(s2) | ||
h3 = hash_pandas_object(s3) | ||
tm.assert_series_equal(h1, h2) | ||
tm.assert_series_equal(h1, h3) | ||
|
||
def test_errors(self): | ||
|
||
for obj in [pd.Timestamp('20130101'), tm.makePanel()]: | ||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
put a
()
as this is a function call, andpandas
->pd
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
add this PR number as the issue number