Description
Preparing a commit for another issue in .describe()
, I encountered this puzzling bug, surprisingly easy to trigger.
Symptoms
df = pd.DataFrame({'A': list("BCDE"), 0: [1,2,3,4]})
df.describe()
# Long traceback listing formatting and internal functions...
ValueError: Buffer dtype mismatch, expected 'Python object' but got 'long'
However:
df.describe(include='all')
0 A
count 4.000000 4
unique NaN 4
top NaN D
freq NaN 1
mean 2.500000 NaN
std 1.290994 NaN
min 1.000000 NaN
25% 1.750000 NaN
50% 2.500000 NaN
75% 3.250000 NaN
max 4.000000 NaN
# It's OK if we don't print on screen:
x = df.describe()
x.columns
Out[8]: Index([0], dtype='int64')
# Fixing this suspicious index (int works too):
x.columns = x.columns.astype(object)
x
Out[10]:
0
count 4.000000
mean 2.500000
std 1.290994
min 1.000000
25% 1.750000
50% 2.500000
75% 3.250000
max 4.000000
Same issue happens with a simpler data frame:
df0 = pd.DataFrame([1,2,3,4])
# It's OK now
df0.describe()
Out[28]:
0
count 4.000000
mean 2.500000
std 1.290994
min 1.000000
25% 1.750000
50% 2.500000
75% 3.250000
max 4.000000
# Modify column index:
df0.columns = pd.Index([0], dtype=object)
df0.describe()
# ...
ValueError: Buffer dtype mismatch, expected 'Python object' but got 'long'
Current version (but the bug is also present in pandas release 0.18.1):
pd.show_versions()
INSTALLED VERSIONS
------------------
commit: None
python: 3.5.1.final.0
python-bits: 64
OS: Linux
OS-release: 4.1.20-1
machine: x86_64
processor: Intel(R)_Core(TM)_i5-2520M_CPU_@_2.50GHz
byteorder: little
LC_ALL: None
LANG: en_US.UTF-8
pandas: 0.18.1+64.g7ed22fe.dirty
nose: 1.3.7
pip: 8.1.2
setuptools: 21.0.0
Cython: 0.24
numpy: 1.11.0
scipy: 0.17.0.dev0+3f3c371
IPython: 4.0.1
...
Reason
Some internal function gets confused by dtypes of a column index, I guess. But the faulty index is created in .describe()
.
# Output from %debug df.describe()
# NDFrame.describe() in pandas/core/generic.py:
#
4943 data = self
4944 else:
4945 data = self.select_dtypes(include=include, exclude=exclude)
4946
4947 ldesc = [describe_1d(s, percentiles) for _, s in data.iteritems()]
4948 # set a convenient order for rows
4949 names = []
4950 ldesc_indexes = sorted([x.index for x in ldesc], key=len)
4951 for idxnames in ldesc_indexes:
4952 for name in idxnames:
4953 if name not in names:
4954 names.append(name)
4955
4956 d = pd.concat(ldesc, join_axes=pd.Index([names]), axis=1)
1> 4957 d.columns = self.columns._shallow_copy(values=d.columns.values)
4958 d.columns.names = data.columns.names
4959 return d
_shallow_copy()
in the marked line changes d.columns
:
ipdb> p d.columns
Int64Index([0], dtype='int64')
ipdb> n
> /home/users/piotr/workspace/pandas-pijucha/pandas/core/generic.py(4958)describe()
1 4957 d.columns = self.columns._shallow_copy(values=d.columns.values)
-> 4958 d.columns.names = data.columns.names
4959 return d
ipdb> p d.columns
Index([0], dtype='int64')
Possible solutions
Lines 4957-4958 are actually used to fix issues that pd.concat
brings about. They try to pass the column structure from self
to d
.
I think a simpler solution is replacing these lines with:
d = pd.concat(ldesc, join_axes=pd.Index([names]), axis=1)
d.columns = data.columns
return d
or
d = pd.DataFrame(pd.concat(ldesc, axis=1), index = pd.Index(names), columns = data.columns)
return d
data
is a subframe of self
and retains the same column structure.
pd.concat
has some parameters that help pass a hierarchical index but can't do anything on its own with a categorical one.
I'm going to submit a pull request with this fix together with some others related with describe()
. I hope I haven't overlooked anything obvious. But if so, any comments are very welcome.