Skip to content

Categoricals hash consistently #15143

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
wants to merge 2 commits into from
Closed
Show file tree
Hide file tree
Changes from 1 commit
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
3 changes: 2 additions & 1 deletion doc/source/whatsnew/v0.20.0.txt
Original file line number Diff line number Diff line change
Expand Up @@ -309,6 +309,7 @@ Bug Fixes
- Bug in ``pd.read_csv()`` in which the ``dialect`` parameter was not being verified before processing (:issue:`14898`)
- Bug in ``pd.read_fwf`` where the skiprows parameter was not being respected during column width inference (:issue:`11256`)
- Bug in ``pd.read_csv()`` in which missing data was being improperly handled with ``usecols`` (:issue:`6710`)
- Bug in ``pandas.tools.hashing.hash_pandas_object`` in which hashing of categoricals depended on the ordering of categories, instead of just their values.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

put a () as this is a function call, and pandas -> pd

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

add this PR number as the issue number


- Bug in ``DataFrame.loc`` with indexing a ``MultiIndex`` with a ``Series`` indexer (:issue:`14730`)

Expand Down Expand Up @@ -369,4 +370,4 @@ Bug Fixes
- Bug in ``Series`` constructor when both ``copy=True`` and ``dtype`` arguments are provided (:issue:`15125`)
- Bug in ``pd.read_csv()`` for the C engine where ``usecols`` were being indexed incorrectly with ``parse_dates`` (:issue:`14792`)

- Bug in ``Series.dt.round`` inconsistent behaviour on NAT's with different arguments (:issue:`14940`)
- Bug in ``Series.dt.round`` inconsistent behaviour on NAT's with different arguments (:issue:`14940`)
57 changes: 30 additions & 27 deletions pandas/tools/hashing.py
Original file line number Diff line number Diff line change
Expand Up @@ -4,7 +4,7 @@

import numpy as np
from pandas import _hash, Series, factorize, Categorical, Index
from pandas.lib import infer_dtype
from pandas.lib import is_bool_array
from pandas.types.generic import ABCIndexClass, ABCSeries, ABCDataFrame
from pandas.types.common import is_categorical_dtype

Expand Down Expand Up @@ -68,7 +68,7 @@ def adder(h, hashed_to_add):
return h


def hash_array(vals, encoding='utf8', hash_key=None):
def hash_array(vals, encoding='utf8', hash_key=None, categorize=True):
"""
Given a 1d array, return an array of deterministic integers.

Expand All @@ -80,53 +80,56 @@ def hash_array(vals, encoding='utf8', hash_key=None):
encoding : string, default 'utf8'
encoding for data & key when strings
hash_key : string key to encode, default to _default_hash_key
categorize : bool, default True
Whether to first categorize object arrays before hashing. This is more
efficient when the array contains duplicate values.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

add a versionadded tag here (0.20.0)

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think also add categorize to hash_pandas_object for consistency as well?

Returns
-------
1d uint64 numpy array of hash values, same length as the vals

"""

# work with cagegoricals as ints. (This check is above the complex
# check so that we don't ask numpy if categorical is a subdtype of
# complex, as it will choke.
if hash_key is None:
hash_key = _default_hash_key

# For categoricals, we hash the categories, then remap the codes to the
# hash values. (This check is above the complex check so that we don't ask
# numpy if categorical is a subdtype of complex, as it will choke.
if is_categorical_dtype(vals.dtype):
vals = vals.codes
cat_hashed = hash_array(vals.categories.values, encoding, hash_key,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

do this like below instead, rather than adding a keyword (where we do basically the same thing)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not sure I follow - we need to recurse through in case the values aren't objects. So we need to be able to handle vals.categories.values of any dtype. However, we don't want to categorize the categories again, since we already know they're unique. This seemed to me the simplest and cleanest way of implementing this.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

maybe show what the problem is
hash_array can hash values JUST like we do below

this code is too complicated

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you explain where you mean "like we do below"? I'm assuming you mean lines 125-130, where we handle object dtype.

For categoricals, what we want to do is:

  • Get a hash for the categories. This should match what hash_pandas_object(series, index=False) would return for the un-categorized data. Meaning hash_pandas_object(object_series) == hash_pandas_object(object_series.astype('category')).
  • Remap the category hashes based on the codes.

vals.categories.values can be of any dtype, so we need to recurse through hash_array again to get the hashes of the categories. However, we also know that the categories are already unique, so we don't want to call factorize again. As such, we set categorize=False to skip that. We can't just call hash_object_array, as the categories may not be objects. And we need to do the remapping, so we can't just set vals = something and fall through like we did before.

I don't think adding this extra keyword overly complicates things, and do think this is the simplest way to do this. I may not be understanding what you're trying to suggest here - perhaps if you could explain a bit better I might get it.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I looked again. This is reasonable. Ideally don't want to repeat the factorize-hash-remap logic as right now this code looks very similar to what you have. maybe pull this out to a common helper function?

I get the problem is now hash_array_object will only de-duplicate object types, when of course we could have any dtypes. Though in practice it is generally not a big deal to simply hash non-objects even if very many duplicates (as simply finding the duplicates can be somewhat expensive). In my tests, only for object does this make a huge difference.

That said not averse to allow a caller to do this (as dask may have more information that pandas, e.g. the cardinaility of the set, or even the uniques already?)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I get the problem is now hash_array_object will only de-duplicate object types, when of course we could have any dtypes. Though in practice it is generally not a big deal to simply hash non-objects even if very many duplicates (as simply finding the duplicates can be somewhat expensive). In my tests, only for object does this make a huge difference.

Not quite. The categorize parameter only indicates whether to categorize object arrays before hashing. Other arrays are treated the same as before. The reason for this is that when we hash the categories attribute on a categorical (and then map the codes to the hashes of the categories), we already know that it is unique, so we don't need to re-categorize before hashing (which would happen for object categoricals, but not for others). This might be more clear now that the _hash_categorical helper is pulled out.

categorize=False).astype(np.uint64, copy=False)
# Since `cat_hashed` is already distributed in the space of uint64s,
# we can just return after remapping the codes here
c = Series(vals)
return c.cat.rename_categories(cat_hashed).values.astype(np.uint64)

# we'll be working with everything as 64-bit values, so handle this
# 128-bit value early
if np.issubdtype(vals.dtype, np.complex128):
return hash_array(vals.real) + 23 * hash_array(vals.imag)

# MAIN LOGIC:
inferred = infer_dtype(vals)

# First, turn whatever array this is into unsigned 64-bit ints, if we can
# manage it.
if inferred == 'boolean':
if is_bool_array(vals):
vals = vals.astype('u8')

if (np.issubdtype(vals.dtype, np.datetime64) or
np.issubdtype(vals.dtype, np.timedelta64) or
np.issubdtype(vals.dtype, np.number)) and vals.dtype.itemsize <= 8:

elif (np.issubdtype(vals.dtype, np.datetime64) or
np.issubdtype(vals.dtype, np.timedelta64) or
np.issubdtype(vals.dtype, np.number)) and vals.dtype.itemsize <= 8:
vals = vals.view('u{}'.format(vals.dtype.itemsize)).astype('u8')
else:

# its MUCH faster to categorize object dtypes, then hash and rename
codes, categories = factorize(vals, sort=False)
categories = Index(categories)
c = Series(Categorical(codes, categories,
ordered=False, fastpath=True))
vals = _hash.hash_object_array(categories.values,
hash_key,
encoding)

# rename & extract
vals = c.cat.rename_categories(Index(vals)).astype(np.uint64).values
# With repeated values, its MUCH faster to categorize object dtypes,
# then hash and rename categories. We allow skipping the categorization
# when the values are known/likely to be unique.
if categorize:
codes, categories = factorize(vals, sort=False)
c = Series(Categorical(codes, Index(categories),
ordered=False, fastpath=True))
vals = _hash.hash_object_array(categories, hash_key, encoding)
# rename & extract
vals = c.cat.rename_categories(vals).values.astype(np.uint64)
else:
vals = _hash.hash_object_array(vals, hash_key, encoding)

# Then, redistribute these 64-bit ints within the space of 64-bit ints
vals ^= vals >> 30
Expand Down
14 changes: 14 additions & 0 deletions pandas/tools/tests/test_hashing.py
Original file line number Diff line number Diff line change
Expand Up @@ -90,6 +90,20 @@ def test_hash_pandas_empty_object(self):
# these are by-definition the same with
# or w/o the index as the data is empty

def test_categorical_consistency(self):
# Check that categoricals hash consistent with their values, not codes
# This should work for categoricals of any dtype
for data in [['a', 'b', 'c', 'd'], [1000, 2000, 3000, 4000]]:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can you test with datetimes as well

s1 = Series(data)
s2 = s1.astype('category').cat.set_categories(data)
s3 = s2.cat.set_categories(list(reversed(data)))
# These should all hash identically
h1 = hash_pandas_object(s1)
h2 = hash_pandas_object(s2)
h3 = hash_pandas_object(s3)
tm.assert_series_equal(h1, h2)
tm.assert_series_equal(h1, h3)

def test_errors(self):

for obj in [pd.Timestamp('20130101'), tm.makePanel()]:
Expand Down