Closed
Description
cc @lodagro
MultiIndex seems to store the level data always as dtype('object').
When using DataFrame.delevel() the added columns from the index also have dtype('object').
This prevents from using DataFrame.delevel.corr() to have a look at the correlation between the original DataFrame columns and the index level values. Does anyone have an idea to work around this?
See example below:
In [1]: import pandas
In [2]: import numpy as np
In [3]: import itertools
In [4]: tuples = [tuple for tuple in itertools.product(['foo', 'bar'], [10, 20], [1.0, 1.1])]
In [5]: index = pandas.MultiIndex.from_tuples(tuples, names=['prm0', 'prm1', 'prm2'])
In [6]: df = pandas.DataFrame(np.random.randn(8,3), columns=['A', 'B', 'C'], index=index)
In [7]: df
Out[7]:
A B C
prm0 prm1 prm2
foo 10 1.0 0.2074 0.3425 -1.295
1.1 0.3194 0.8114 2.133
foo 20 1.0 -0.1798 -1.162 0.5774
1.1 -0.4635 1.436 1.419
bar 10 1.0 -1.013 0.7605 -1.184
1.1 -0.4716 0.6983 0.5209
bar 20 1.0 -0.87 -0.3788 0.272
1.1 1.018 -0.4496 1.132
In [8]: df.corr()
Out[8]:
A B C
A 1 -0.2445 0.3852
B -0.2445 1 0.08211
C 0.3852 0.08211 1
In [9]: df.delevel().corr()
---------------------------------------------------------------------------
ValueError Traceback (most recent call last)
...
2535 cols = self.columns
2536 mat = self.as_matrix(cols).T
-> 2537 baseCov = np.cov(mat)
2538
2539 sigma = np.sqrt(np.diag(baseCov))
.../python2.7/site-packages/numpy/lib/function_base.pyc in cov(m, y, rowvar, bias, ddof)
1920 raise ValueError("ddof must be integer")
1921
-> 1922 X = array(m, ndmin=2, dtype=float)
1923 if X.shape[0] == 1:
1924 rowvar = 1
ValueError: setting an array element with a sequence.
My guess is that this exception is related to the fact corr can not work with strings.
So let`s try it without the strings.
In [10]: df.delevel()[['prm1', 'prm2', 'A', 'B', 'C']]
Out[10]:
prm1 prm2 A B C
0 10 1 0.2074 0.3425 -1.295
1 10 1.1 0.3194 0.8114 2.133
2 20 1 -0.1798 -1.162 0.5774
3 20 1.1 -0.4635 1.436 1.419
4 10 1 -1.013 0.7605 -1.184
5 10 1.1 -0.4716 0.6983 0.5209
6 20 1 -0.87 -0.3788 0.272
7 20 1.1 1.018 -0.4496 1.132
In [11]: df.delevel()[['prm1', 'prm2', 'A', 'B', 'C']].corr()
---------------------------------------------------------------------------
TypeError Traceback (most recent call last)
[...]
TypeError: function not supported for these types, and can't coerce safely to supported types
In [12]: df.delevel()['prm1'].values.dtype
Out[12]: dtype('object')
In [13]: df.delevel()['prm1']
Out[13]:
0 10
1 10
2 20
3 20
4 10
5 10
6 20
7 20
Name: prm1
In [14]: index.levels
Out[14]:
[Index([bar, foo], dtype=object),
Index([10, 20], dtype=object),
Index([1.0, 1.1], dtype=object)]