Closed
Description
When doing a groupby
on more than one column, the resulting MultiIndex
does not seem to preserve the original column dtypes. I noticed it when working with Categorical
columns, expecting CategoricalIndex
when grouping on them, but this is only the case when grouping on just one column.
I did see that the behaviour was discussed in a PR, but it ultimately was not addressed.
Code Sample, a copy-pastable example if possible
In [1]: import pandas as pd
In [2]: df = pd.DataFrame({
...: 'a': pd.Series(list('xyxxyz')).astype('category', categories=list('xyz')),
...: 'b': pd.Series(list('yzzyxz')).astype('category', categories=list('xyz')),
...: 'c': [1,2,3,4,5,6]
...: })
In [3]: df.groupby('a').sum().reset_index().dtypes
Out[3]:
a category
c int64
dtype: object
In [4]: df.groupby(['a', 'b']).sum().reset_index().dtypes
Out[4]:
a object
b object
c float64
dtype: object
Expected Output
In [4]: df.groupby(['a', 'b']).sum().reset_index().dtypes
Out[4]:
a category
b category
c int64
dtype: object
output of pd.show_versions()
In [5]: pd.show_versions()
INSTALLED VERSIONS
------------------
commit: None
python: 2.7.11.final.0
python-bits: 64
OS: Linux
OS-release: 4.4.13
machine: x86_64
processor:
byteorder: little
LC_ALL: None
LANG: en_US.UTF-8
LOCALE: None.None
pandas: 0.18.1+240.gbb6b5e5
nose: None
pip: 8.1.2
setuptools: 19.4
Cython: 0.24.1
numpy: 1.11.1
scipy: 0.17.1
statsmodels: 0.6.1
xarray: None
IPython: 5.0.0
sphinx: None
patsy: 0.4.1
dateutil: 2.5.3
pytz: 2016.4
blosc: None
bottleneck: None
tables: None
numexpr: None
matplotlib: 1.5.1
openpyxl: None
xlrd: None
xlwt: None
xlsxwriter: 0.9.3
lxml: None
bs4: None
html5lib: None
httplib2: None
apiclient: None
sqlalchemy: 1.0.14
pymysql: None
psycopg2: 2.6.2 (dt dec pq3 ext lo64)
jinja2: 2.8
boto: None
pandas_datareader: None