Description
Code Sample, a copy-pastable example if possible
import numpy as np
import pandas as pd
df = pd.DataFrame(np.random.randint(10,100, size=(200,6)), columns=['C'+str(i) for i in range(6)])
df['C0'] = ['A','B','C','D']*50
df['C1'] = ['E','F']*100
df['C2'] = ['H','I','J','K', 'L']*40
for col in df.columns[:3]:
df[col] = df[col].astype('category')
df.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 200 entries, 0 to 199
Data columns (total 6 columns):
C0 200 non-null category
C1 200 non-null category
C2 200 non-null category
C3 200 non-null int32
C4 200 non-null int32
C5 200 non-null int32
dtypes: category(3), int32(3)
memory usage: 3.1 KB
%time ix_true = df.groupby(df.columns.tolist()[:3], as_index=True)['C5'].max()
Wall time: 2 ms
ix_true.shape
(20,)
%time ix_false = df.groupby(df.columns.tolist()[:3], as_index=False)['C5'].max()
Wall time: 15 ms
ix_false.shape
(40, 4)
ix_true
C0 C1 C2
A E H 93
I 99
J 88
K 91
L 94
B F H 98
I 89
J 94
K 92
L 96
C E H 96
I 96
J 85
K 88
L 98
D F H 96
I 84
J 71
K 96
L 94
Name: C5, dtype: int32
ix_false
C0 C1 C2 C5
0 A E H 93.0
1 A E I 99.0
2 A E J 88.0
3 A E K 91.0
4 A E L 94.0
5 A F H NaN
6 A F I NaN
7 A F J NaN
8 A F K NaN
9 A F L NaN
10 B E H NaN
11 B E I NaN
12 B E J NaN
13 B E K NaN
14 B E L NaN
15 B F H 98.0
16 B F I 89.0
17 B F J 94.0
18 B F K 92.0
19 B F L 96.0
20 C E H 96.0
21 C E I 96.0
22 C E J 85.0
23 C E K 88.0
24 C E L 98.0
25 C F H NaN
26 C F I NaN
27 C F J NaN
28 C F K NaN
29 C F L NaN
30 D E H NaN
31 D E I NaN
32 D E J NaN
33 D E K NaN
34 D E L NaN
35 D F H 96.0
36 D F I 84.0
37 D F J 71.0
38 D F K 96.0
39 D F L 94.0
Problem description
Using as_index=False
in df.groupby(df.columns.tolist()[:3], as_index=False)['C5'].max()
with categorical columns produces NaN output rows.
Expected Output
I expect that the output should not contain any extra Cardinality explosion and have the same number of rows. as the as_index=True
case.
Output of pd.show_versions()
pandas: 0.19.2
nose: 1.3.7
pip: 8.1.1
setuptools: 20.10.1
Cython: 0.24.1
numpy: 1.11.2
scipy: 0.18.1
statsmodels: 0.6.1
xarray: 0.7.1
IPython: 4.2.0
sphinx: 1.3.6
patsy: 0.4.0
dateutil: 2.5.0
pytz: 2015.7
blosc: None
bottleneck: 1.0.0
tables: 3.3.0
numexpr: 2.6.1
matplotlib: 1.5.1
openpyxl: 2.2.5
xlrd: 0.9.4
xlwt: 1.0.0
xlsxwriter: None
lxml: 3.6.4
bs4: 4.4.1
html5lib: None
httplib2: None
apiclient: None
sqlalchemy: 1.0.12
pymysql: None
psycopg2: None
jinja2: 2.8
boto: None
pandas_datareader: None