Skip to content

BUG: DataFrame.describe() breaks with a column index of object type and numeric entries #13288

Closed
@pijucha

Description

@pijucha

Preparing a commit for another issue in .describe(), I encountered this puzzling bug, surprisingly easy to trigger.

Symptoms

df = pd.DataFrame({'A': list("BCDE"), 0: [1,2,3,4]})
df.describe()
# Long traceback listing formatting and internal functions...
ValueError: Buffer dtype mismatch, expected 'Python object' but got 'long'

However:

df.describe(include='all')
               0    A
count   4.000000    4
unique       NaN    4
top          NaN    D
freq         NaN    1
mean    2.500000  NaN
std     1.290994  NaN
min     1.000000  NaN
25%     1.750000  NaN
50%     2.500000  NaN
75%     3.250000  NaN
max     4.000000  NaN

# It's OK if we don't print on screen:
x = df.describe()
x.columns
Out[8]: Index([0], dtype='int64')

# Fixing this suspicious index (int works too):
x.columns = x.columns.astype(object)
x
Out[10]: 
              0
count  4.000000
mean   2.500000
std    1.290994
min    1.000000
25%    1.750000
50%    2.500000
75%    3.250000
max    4.000000

Same issue happens with a simpler data frame:

df0 = pd.DataFrame([1,2,3,4])
# It's  OK now
df0.describe()
Out[28]: 
              0
count  4.000000
mean   2.500000
std    1.290994
min    1.000000
25%    1.750000
50%    2.500000
75%    3.250000
max    4.000000

# Modify column index:
df0.columns = pd.Index([0], dtype=object)
df0.describe()
# ...
ValueError: Buffer dtype mismatch, expected 'Python object' but got 'long'

Current version (but the bug is also present in pandas release 0.18.1):

pd.show_versions()

INSTALLED VERSIONS
------------------
commit: None
python: 3.5.1.final.0
python-bits: 64
OS: Linux
OS-release: 4.1.20-1
machine: x86_64
processor: Intel(R)_Core(TM)_i5-2520M_CPU_@_2.50GHz
byteorder: little
LC_ALL: None
LANG: en_US.UTF-8

pandas: 0.18.1+64.g7ed22fe.dirty
nose: 1.3.7
pip: 8.1.2
setuptools: 21.0.0
Cython: 0.24
numpy: 1.11.0
scipy: 0.17.0.dev0+3f3c371
IPython: 4.0.1
...

Reason

Some internal function gets confused by dtypes of a column index, I guess. But the faulty index is created in .describe().

# Output from %debug df.describe()
# NDFrame.describe() in pandas/core/generic.py:
#
   4943             data = self
   4944         else:
   4945             data = self.select_dtypes(include=include, exclude=exclude)
   4946 
   4947         ldesc = [describe_1d(s, percentiles) for _, s in data.iteritems()]
   4948         # set a convenient order for rows
   4949         names = []
   4950         ldesc_indexes = sorted([x.index for x in ldesc], key=len)
   4951         for idxnames in ldesc_indexes:
   4952             for name in idxnames:
   4953                 if name not in names:
   4954                     names.append(name)
   4955 
   4956         d = pd.concat(ldesc, join_axes=pd.Index([names]), axis=1)
1> 4957         d.columns = self.columns._shallow_copy(values=d.columns.values)
   4958         d.columns.names = data.columns.names
   4959         return d

_shallow_copy() in the marked line changes d.columns:

ipdb> p d.columns
Int64Index([0], dtype='int64')
ipdb> n
> /home/users/piotr/workspace/pandas-pijucha/pandas/core/generic.py(4958)describe()
1  4957         d.columns = self.columns._shallow_copy(values=d.columns.values)
-> 4958         d.columns.names = data.columns.names
   4959         return d
ipdb> p d.columns
Index([0], dtype='int64')

Possible solutions

Lines 4957-4958 are actually used to fix issues that pd.concat brings about. They try to pass the column structure from self to d.
I think a simpler solution is replacing these lines with:

 d = pd.concat(ldesc, join_axes=pd.Index([names]), axis=1)
 d.columns = data.columns
 return d

or

d = pd.DataFrame(pd.concat(ldesc, axis=1), index = pd.Index(names), columns = data.columns)
return d

data is a subframe of self and retains the same column structure.

pd.concat has some parameters that help pass a hierarchical index but can't do anything on its own with a categorical one.

I'm going to submit a pull request with this fix together with some others related with describe(). I hope I haven't overlooked anything obvious. But if so, any comments are very welcome.

Metadata

Metadata

Assignees

No one assigned

    Labels

    Type

    No type

    Projects

    No projects

    Milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions