Skip to content

DataFrame.delevel infer dtypes better #440

Closed
@wesm

Description

@wesm

cc @lodagro


MultiIndex seems to store the level data always as dtype('object').
When using DataFrame.delevel() the added columns from the index also have dtype('object').
This prevents from using DataFrame.delevel.corr() to have a look at the correlation between the original DataFrame columns and the index level values. Does anyone have an idea to work around this?

See example below:

In [1]: import pandas

In [2]: import numpy as np

In [3]: import itertools

In [4]: tuples = [tuple for tuple in itertools.product(['foo', 'bar'], [10, 20], [1.0, 1.1])]

In [5]: index = pandas.MultiIndex.from_tuples(tuples, names=['prm0', 'prm1', 'prm2'])

In [6]: df = pandas.DataFrame(np.random.randn(8,3), columns=['A', 'B', 'C'], index=index)

In [7]: df
Out[7]:
                A       B       C
prm0 prm1 prm2
foo  10   1.0   0.2074  0.3425 -1.295
          1.1   0.3194  0.8114  2.133
foo  20   1.0  -0.1798 -1.162   0.5774
          1.1  -0.4635  1.436   1.419
bar  10   1.0  -1.013   0.7605 -1.184
          1.1  -0.4716  0.6983  0.5209
bar  20   1.0  -0.87   -0.3788  0.272
          1.1   1.018  -0.4496  1.132

In [8]: df.corr()
Out[8]:
   A       B        C
A  1      -0.2445   0.3852
B -0.2445  1        0.08211
C  0.3852  0.08211  1

In [9]: df.delevel().corr()
---------------------------------------------------------------------------
ValueError                                Traceback (most recent call last)
...
   2535         cols = self.columns
   2536         mat = self.as_matrix(cols).T
-> 2537         baseCov = np.cov(mat)
   2538
   2539         sigma = np.sqrt(np.diag(baseCov))

.../python2.7/site-packages/numpy/lib/function_base.pyc in cov(m, y, rowvar, bias, ddof)
   1920         raise ValueError("ddof must be integer")
   1921
-> 1922     X = array(m, ndmin=2, dtype=float)
   1923     if X.shape[0] == 1:
   1924         rowvar = 1

ValueError: setting an array element with a sequence.

My guess is that this exception is related to the fact corr can not work with strings.
So let`s try it without the strings. 

In [10]: df.delevel()[['prm1', 'prm2', 'A', 'B', 'C']]
Out[10]:
   prm1  prm2  A       B       C
0  10    1     0.2074  0.3425 -1.295
1  10    1.1   0.3194  0.8114  2.133
2  20    1    -0.1798 -1.162   0.5774
3  20    1.1  -0.4635  1.436   1.419
4  10    1    -1.013   0.7605 -1.184
5  10    1.1  -0.4716  0.6983  0.5209
6  20    1    -0.87   -0.3788  0.272
7  20    1.1   1.018  -0.4496  1.132

In [11]: df.delevel()[['prm1', 'prm2', 'A', 'B', 'C']].corr()
---------------------------------------------------------------------------
TypeError                                 Traceback (most recent call last)
[...]
TypeError: function not supported for these types, and can't coerce safely to supported types

In [12]: df.delevel()['prm1'].values.dtype
Out[12]: dtype('object')

In [13]: df.delevel()['prm1']
Out[13]:
0    10
1    10
2    20
3    20
4    10
5    10
6    20
7    20
Name: prm1

In [14]: index.levels
Out[14]:
[Index([bar, foo], dtype=object),
 Index([10, 20], dtype=object),
 Index([1.0, 1.1], dtype=object)]

Metadata

Metadata

Assignees

No one assigned

    Type

    No type

    Projects

    No projects

    Milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions