-
-
Notifications
You must be signed in to change notification settings - Fork 18.5k
Categoricals hash consistently #15143
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
Previously categorical values were hashed using just their codes. This meant that the hash value depended on the ordering of the categories, rather than on the values the series represented. This caused problems in dask, where different partitions might have different categorical mappings. This PR makes the hashing dependent on the values the categorical represents, rather than on the codes. The categories are first hashed, and then the codes are remapped to the hashed values. This is slightly slower than before (still need to hash the categories, where we didn't before), but allows for more consistent hashing.
if is_categorical_dtype(vals.dtype): | ||
vals = vals.codes | ||
cat_hashed = hash_array(vals.categories.values, encoding, hash_key, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
do this like below instead, rather than adding a keyword (where we do basically the same thing)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm not sure I follow - we need to recurse through in case the values aren't objects. So we need to be able to handle vals.categories.values
of any dtype. However, we don't want to categorize the categories again, since we already know they're unique. This seemed to me the simplest and cleanest way of implementing this.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
maybe show what the problem is
hash_array can hash values JUST like we do below
this code is too complicated
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can you explain where you mean "like we do below"? I'm assuming you mean lines 125-130, where we handle object dtype.
For categoricals, what we want to do is:
- Get a hash for the
categories
. This should match whathash_pandas_object(series, index=False)
would return for the un-categorized data. Meaninghash_pandas_object(object_series) == hash_pandas_object(object_series.astype('category'))
. - Remap the category hashes based on the codes.
vals.categories.values
can be of any dtype, so we need to recurse through hash_array
again to get the hashes of the categories. However, we also know that the categories are already unique, so we don't want to call factorize
again. As such, we set categorize=False
to skip that. We can't just call hash_object_array
, as the categories may not be objects. And we need to do the remapping, so we can't just set vals = something
and fall through like we did before.
I don't think adding this extra keyword overly complicates things, and do think this is the simplest way to do this. I may not be understanding what you're trying to suggest here - perhaps if you could explain a bit better I might get it.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I looked again. This is reasonable. Ideally don't want to repeat the factorize-hash-remap logic as right now this code looks very similar to what you have. maybe pull this out to a common helper function?
I get the problem is now hash_array_object
will only de-duplicate object types, when of course we could have any dtypes. Though in practice it is generally not a big deal to simply hash non-objects even if very many duplicates (as simply finding the duplicates can be somewhat expensive). In my tests, only for object does this make a huge difference.
That said not averse to allow a caller to do this (as dask may have more information that pandas, e.g. the cardinaility of the set, or even the uniques already?)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I get the problem is now hash_array_object will only de-duplicate object types, when of course we could have any dtypes. Though in practice it is generally not a big deal to simply hash non-objects even if very many duplicates (as simply finding the duplicates can be somewhat expensive). In my tests, only for object does this make a huge difference.
Not quite. The categorize
parameter only indicates whether to categorize
object arrays before hashing. Other arrays are treated the same as before. The reason for this is that when we hash the categories
attribute on a categorical (and then map the codes to the hashes of the categories), we already know that it is unique, so we don't need to re-categorize before hashing (which would happen for object categoricals, but not for others). This might be more clear now that the _hash_categorical
helper is pulled out.
@@ -80,53 +80,56 @@ def hash_array(vals, encoding='utf8', hash_key=None): | |||
encoding : string, default 'utf8' | |||
encoding for data & key when strings | |||
hash_key : string key to encode, default to _default_hash_key | |||
categorize : bool, default True | |||
Whether to first categorize object arrays before hashing. This is more | |||
efficient when the array contains duplicate values. | |||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
add a versionadded tag here (0.20.0)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think also add categorize
to hash_pandas_object for consistency as well?
@@ -309,6 +309,7 @@ Bug Fixes | |||
- Bug in ``pd.read_csv()`` in which the ``dialect`` parameter was not being verified before processing (:issue:`14898`) | |||
- Bug in ``pd.read_fwf`` where the skiprows parameter was not being respected during column width inference (:issue:`11256`) | |||
- Bug in ``pd.read_csv()`` in which missing data was being improperly handled with ``usecols`` (:issue:`6710`) | |||
- Bug in ``pandas.tools.hashing.hash_pandas_object`` in which hashing of categoricals depended on the ordering of categories, instead of just their values. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
put a ()
as this is a function call, and pandas
-> pd
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
add this PR number as the issue number
def test_categorical_consistency(self): | ||
# Check that categoricals hash consistent with their values, not codes | ||
# This should work for categoricals of any dtype | ||
for data in [['a', 'b', 'c', 'd'], [1000, 2000, 3000, 4000]]: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
can you test with datetimes as well
- Add `categorize` parameter to `hash_pandas_object` - Update test - Update whatsnew
Thanks for the review. Addressed all comments. |
lgtm. ping on green. (note appveyor is currently recovering, so may take a while) :< |
Windows tests passed, travis failures look unrelated? |
thanks @jcrist |
Previously categorical values were hashed using just their codes. This meant that the hash value depended on the ordering of the categories, rather than on the values the series represented. This caused problems in dask, where different partitions might have different categorical mappings. This PR makes the hashing dependent on the values the categorical represents, rather than on the codes. The categories are first hashed, and then the codes are remapped to the hashed values. This is slightly slower than before (still need to hash the categories, where we didn't before), but allows for more consistent hashing. Related to this work in dask: dask/dask#1877. Author: Jim Crist <[email protected]> Closes pandas-dev#15143 from jcrist/categories_hash_consistently and squashes the following commits: f1aea13 [Jim Crist] Address comments 7878c55 [Jim Crist] Categoricals hash consistently
Previously categorical values were hashed using just their codes. This
meant that the hash value depended on the ordering of the categories,
rather than on the values the series represented. This caused problems
in dask, where different partitions might have different categorical
mappings.
This PR makes the hashing dependent on the values the categorical
represents, rather than on the codes. The categories are first hashed,
and then the codes are remapped to the hashed values. This is slightly
slower than before (still need to hash the categories, where we didn't
before), but allows for more consistent hashing.
Related to this work in dask: dask/dask#1877.