Skip to content

Categorical Column GroupBy agg with as_index=False produces NaN rows 7.5X Slower with unexpected extra Cardinality #15217

Closed
@dragoljub

Description

@dragoljub

Code Sample, a copy-pastable example if possible

import numpy as np
import pandas as pd

df = pd.DataFrame(np.random.randint(10,100, size=(200,6)), columns=['C'+str(i) for i in range(6)])
df['C0'] = ['A','B','C','D']*50
df['C1'] = ['E','F']*100
df['C2'] = ['H','I','J','K', 'L']*40

for col in df.columns[:3]:
    df[col] = df[col].astype('category')

df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 200 entries, 0 to 199
Data columns (total 6 columns):
C0    200 non-null category
C1    200 non-null category
C2    200 non-null category
C3    200 non-null int32
C4    200 non-null int32
C5    200 non-null int32
dtypes: category(3), int32(3)
memory usage: 3.1 KB

%time ix_true = df.groupby(df.columns.tolist()[:3], as_index=True)['C5'].max()
Wall time: 2 ms

ix_true.shape
(20,)

%time ix_false = df.groupby(df.columns.tolist()[:3], as_index=False)['C5'].max()
Wall time: 15 ms

ix_false.shape
(40, 4)

ix_true
C0  C1  C2
A   E   H     93
        I     99
        J     88
        K     91
        L     94
B   F   H     98
        I     89
        J     94
        K     92
        L     96
C   E   H     96
        I     96
        J     85
        K     88
        L     98
D   F   H     96
        I     84
        J     71
        K     96
        L     94
Name: C5, dtype: int32

ix_false
	C0	C1	C2	C5
0	A	E	H	93.0
1	A	E	I	99.0
2	A	E	J	88.0
3	A	E	K	91.0
4	A	E	L	94.0
5	A	F	H	NaN
6	A	F	I	NaN
7	A	F	J	NaN
8	A	F	K	NaN
9	A	F	L	NaN
10	B	E	H	NaN
11	B	E	I	NaN
12	B	E	J	NaN
13	B	E	K	NaN
14	B	E	L	NaN
15	B	F	H	98.0
16	B	F	I	89.0
17	B	F	J	94.0
18	B	F	K	92.0
19	B	F	L	96.0
20	C	E	H	96.0
21	C	E	I	96.0
22	C	E	J	85.0
23	C	E	K	88.0
24	C	E	L	98.0
25	C	F	H	NaN
26	C	F	I	NaN
27	C	F	J	NaN
28	C	F	K	NaN
29	C	F	L	NaN
30	D	E	H	NaN
31	D	E	I	NaN
32	D	E	J	NaN
33	D	E	K	NaN
34	D	E	L	NaN
35	D	F	H	96.0
36	D	F	I	84.0
37	D	F	J	71.0
38	D	F	K	96.0
39	D	F	L	94.0

Problem description

Using as_index=False in df.groupby(df.columns.tolist()[:3], as_index=False)['C5'].max() with categorical columns produces NaN output rows.

Expected Output

I expect that the output should not contain any extra Cardinality explosion and have the same number of rows. as the as_index=True case.

Output of pd.show_versions()

INSTALLED VERSIONS ------------------ commit: None python: 3.5.2.final.0 python-bits: 64 OS: Windows OS-release: 7 machine: AMD64 processor: Intel64 Family 6 Model 63 Stepping 2, GenuineIntel byteorder: little LC_ALL: None LANG: None LOCALE: None.None

pandas: 0.19.2
nose: 1.3.7
pip: 8.1.1
setuptools: 20.10.1
Cython: 0.24.1
numpy: 1.11.2
scipy: 0.18.1
statsmodels: 0.6.1
xarray: 0.7.1
IPython: 4.2.0
sphinx: 1.3.6
patsy: 0.4.0
dateutil: 2.5.0
pytz: 2015.7
blosc: None
bottleneck: 1.0.0
tables: 3.3.0
numexpr: 2.6.1
matplotlib: 1.5.1
openpyxl: 2.2.5
xlrd: 0.9.4
xlwt: 1.0.0
xlsxwriter: None
lxml: 3.6.4
bs4: 4.4.1
html5lib: None
httplib2: None
apiclient: None
sqlalchemy: 1.0.12
pymysql: None
psycopg2: None
jinja2: 2.8
boto: None
pandas_datareader: None

Metadata

Metadata

Assignees

No one assigned

    Labels

    Type

    No type

    Projects

    No projects

    Milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions