-
-
Notifications
You must be signed in to change notification settings - Fork 18.6k
Improve categorical concat speed by ~20x #10597
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Changes from all commits
File filter
Filter by extension
Conversations
Jump to
Diff view
Diff view
There are no files selected for viewing
Original file line number | Diff line number | Diff line change |
---|---|---|
|
@@ -1715,18 +1715,20 @@ def _convert_to_list_like(list_like): | |
return [list_like] | ||
|
||
def _concat_compat(to_concat, axis=0): | ||
""" | ||
provide concatenation of an object/categorical array of arrays each of which is a single dtype | ||
"""Concatenate an object/categorical array of arrays, each of which is a | ||
single dtype | ||
|
||
Parameters | ||
---------- | ||
to_concat : array of arrays | ||
axis : axis to provide concatenation | ||
in the current impl this is always 0, e.g. we only have 1-d categoricals | ||
axis : int | ||
Axis to provide concatenation in the current implementation this is | ||
always 0, e.g. we only have 1D categoricals | ||
|
||
Returns | ||
------- | ||
a single array, preserving the combined dtypes | ||
Categorical | ||
A single array, preserving the combined dtypes | ||
""" | ||
|
||
def convert_categorical(x): | ||
|
@@ -1735,31 +1737,34 @@ def convert_categorical(x): | |
return x.get_values() | ||
return x.ravel() | ||
|
||
typs = get_dtype_kinds(to_concat) | ||
if not len(typs-set(['object','category'])): | ||
|
||
# we only can deal with object & category types | ||
pass | ||
|
||
else: | ||
|
||
if get_dtype_kinds(to_concat) - set(['object', 'category']): | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. here There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. that's only if there's stuff besides object and category ... i have only object and category then the else can definitely be hit There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. your comment just below that says
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. @jreback here's an example of how to hit that code path In [1]: s = pd.Series(list('aabbcd')*1000000)
In [2]: s2 = pd.Series(list('aabbcd')*1000000).astype('category')
In [3]: pd.concat([s,s2]) There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. no I get that There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. why not? we have
what else is there to check? There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. ok, back at computer. I think the issue is what do do with an So I would do this. have all matching categories (exactly), with no object in a single path There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. nvm. I was relalized that you are doing exactly what I was saying :) duh. go ahead. There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. so i think you're pointing out something that i'm not addressing: In [4]: s = pd.Series(list('aabbcd'))
In [5]: s2 = pd.Series(list('aabb')).astype('category')
In [6]: pd.concat([s,s2])
Out[6]:
0 a
1 a
2 b
3 b
4 NaN
5 NaN
0 a
1 a
2 b
3 b
dtype: category
Categories (2, object): [a, b] However, that was run on current master and is a separate issue from the performance of concatenation. There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. I think I can fix this by simply not passing the second argument to the else branch |
||
# convert to object type and perform a regular concat | ||
from pandas.core.common import _concat_compat | ||
return _concat_compat([ np.array(x,copy=False).astype('object') for x in to_concat ],axis=0) | ||
return _concat_compat([np.array(x, copy=False, dtype=object) | ||
for x in to_concat], axis=0) | ||
|
||
# we could have object blocks and categorical's here | ||
# if we only have a single cateogoricals then combine everything | ||
# we could have object blocks and categoricals here | ||
# if we only have a single categoricals then combine everything | ||
# else its a non-compat categorical | ||
categoricals = [ x for x in to_concat if is_categorical_dtype(x.dtype) ] | ||
objects = [ x for x in to_concat if is_object_dtype(x.dtype) ] | ||
categoricals = [x for x in to_concat if is_categorical_dtype(x.dtype)] | ||
|
||
# validate the categories | ||
categories = None | ||
for x in categoricals: | ||
if categories is None: | ||
categories = x.categories | ||
if not categories.equals(x.categories): | ||
categories = categoricals[0] | ||
rawcats = categories.categories | ||
for x in categoricals[1:]: | ||
if not categories.is_dtype_equal(x): | ||
raise ValueError("incompatible categories in categorical concat") | ||
|
||
# concat them | ||
return Categorical(np.concatenate([ convert_categorical(x) for x in to_concat ],axis=0), categories=categories) | ||
# we've already checked that all categoricals are the same, so if their | ||
# length is equal to the input then we have all the same categories | ||
if len(categoricals) == len(to_concat): | ||
# concating numeric types is much faster than concating object types | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. I don't think you can ever hit the else. The cats have to be exactly the same by definition (otherwise you are raising). So the path where you cast everything to object (then regular concat) should be there. There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. you can definitely hit the There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. that would be hit above (2 different dtypes) There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. how so? if There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. that logic only handle the case of having other dtypes besides object and category There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. the check on len bothers me that make sense if everything is a categorical and identical but by definition that would be true as u just checked dtypes There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. not if you're concatenating an object dtype with a categorical dtype |
||
# and fastpath takes a shorter path through the constructor | ||
return Categorical(np.concatenate([x.codes for x in to_concat], axis=0), | ||
rawcats, | ||
ordered=categoricals[0].ordered, | ||
fastpath=True) | ||
else: | ||
concatted = np.concatenate(list(map(convert_categorical, to_concat)), | ||
axis=0) | ||
return Categorical(concatted, rawcats) |
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,16 @@ | ||
from vbench.benchmark import Benchmark | ||
from datetime import datetime | ||
|
||
common_setup = """from pandas_vb_common import * | ||
""" | ||
|
||
#---------------------------------------------------------------------- | ||
# Series constructors | ||
|
||
setup = common_setup + """ | ||
s = pd.Series(list('aabbcd') * 1000000).astype('category') | ||
""" | ||
|
||
concat_categorical = \ | ||
Benchmark("concat([s, s])", setup=setup, name='concat_categorical', | ||
start_date=datetime(year=2015, month=7, day=15)) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
maybe add
assert axis == 0
?