Skip to content

Partial Selection on MultiIndex: The need for empty slice support & dict indexing #4036

Closed
@dragoljub

Description

@dragoljub

related #4036, #4116
from SO with 0.14.0 multi-index slicers: http://stackoverflow.com/questions/24126542/pandas-multi-index-slices-for-level-names/24126676#24126676

Here's the example from there:

In [11]: midx = pd.MultiIndex.from_product([list(range(3)),['a','b','c'],pd.date_range('20130101',periods=3)],names=['numbers','letters','dates'])

In [12]: midx.names.index('letters')
Out[12]: 1

In [13]: midx.names.index('dates')
Out[13]: 2

Here's a complete example

In [18]: df = DataFrame(np.random.randn(len(midx),1),index=midx)

In [19]: df
Out[19]: 
                                   0
numbers letters dates               
0       a       2013-01-01  0.261092
                2013-01-02 -1.267770
                2013-01-03  0.008230
        b       2013-01-01 -1.515866
                2013-01-02  0.351942
                2013-01-03 -0.245463
        c       2013-01-01 -0.253103
                2013-01-02 -0.385411
                2013-01-03 -1.740821
1       a       2013-01-01 -0.108325
                2013-01-02 -0.212350
                2013-01-03  0.021097
        b       2013-01-01 -1.922214
                2013-01-02 -1.769003
                2013-01-03 -0.594216
        c       2013-01-01 -0.419775
                2013-01-02  1.511700
                2013-01-03  0.994332
2       a       2013-01-01 -0.020299
                2013-01-02 -0.749474
                2013-01-03 -1.478558
        b       2013-01-01 -1.357671
                2013-01-02  0.161185
                2013-01-03 -0.658246
        c       2013-01-01 -0.564796
                2013-01-02 -0.333106
                2013-01-03 -2.814611

This is your dict of level names -> slices

In [20]: slicers = { 'numbers' : slice(0,1), 'dates' : slice('20130102','20130103') }

This creates an indexer that is empty (selects everything)

In [21]: indexer = [ slice(None) ] * len(df.index.levels)

Add in your slicers

In [22]: for n, idx in slicers.items():
              indexer[df.index.names.index(n)] = idx

And select (this has to be a tuple, but was a list to start as we had to modify it)

In [23]: df.loc[tuple(indexer),:]
Out[23]: 
                                   0
numbers letters dates               
0       a       2013-01-02 -1.267770
                2013-01-03  0.008230
        b       2013-01-02  0.351942
                2013-01-03 -0.245463
        c       2013-01-02 -0.385411
                2013-01-03 -1.740821
1       a       2013-01-02 -0.212350
                2013-01-03  0.021097
        b       2013-01-02 -1.769003
                2013-01-03 -0.594216
        c       2013-01-02  1.511700
                2013-01-03  0.994332

I use hierarchical indices regularly with pandas DataFrames and Series objects. It is invaluable to be able to partially select subsets of rows based on a set of arbitrary index values, and retain the index information for subsequent groupby operations etc.

I am looking for an elegant way to pass an ordered tuple (with possibly empty slices) or an arbitrary dict of {index_level_name:value,...} pairs to select rows matching the passed index:value pairs. Note: I am aware that with Boolean indexing on data columns and nested np.logical_and() statements you can construct such a Boolean select index. I'm looking for an elegant solution using indexes & levels to avoid repeatedly using df.reset_index and building Boolean arrays. Also, df.xs() does not work in every situation (see below) and does not exist for Series with MultiIndex.

To explain this lets create a DataFrame with 5 index levels:

In [1]: import numpy as np

In [2]: import pandas as pd

In [3]: print pd.__version__
0.10.0

In [4]: # Generate Test DataFrame
   ...: NUM_ROWS = 100000
   ...: 

In [5]: NUM_COLS = 10

In [6]: col_names = ['A'+num for num in map(str,np.arange(NUM_COLS).tolist())]

In [7]: index_cols = col_names[:5]

In [8]: # Set DataFrame to have 5 level Hierarchical Index.
   ...: # The dtype does not matter try str or np.int64 same results.
   ...: # Sort the index!
   ...: df = pd.DataFrame(np.random.randint(5, size=(NUM_ROWS,NUM_COLS)), dtype=np.int64, columns=col_names)
   ...: 

In [9]: df = df.set_index(index_cols).sort_index()

In [10]: df
Out[10]: <class 'pandas.core.frame.DataFrame'>
MultiIndex: 100000 entries, (0, 0, 0, 0, 0) to (4, 4, 4, 4, 4)
Data columns:
A5    100000  non-null values
A6    100000  non-null values
A7    100000  non-null values
A8    100000  non-null values
A9    100000  non-null values
dtypes: int64(5)

In [11]: df.index.names
Out[11]: ['A0', 'A1', 'A2', 'A3', 'A4']

Now index on every level and we get back the rows we want :) I love that I get back the complete index too because it may be useful later.

In [12]: df.ix[(0,1,2,3,4)]
Out[12]:                 A5  A6  A7  A8  A9
A0 A1 A2 A3 A4                    
0  1  2  3  4    1   3   1   1   3
            4    4   3   4   2   4
            4    0   2   3   1   3
      ...
            4    1   1   3   4   3
            4    0   1   2   4   1

Now if we index on the first 4 levels we get back something different, a data frame with the first 4 index levels dropped. It would be nice to have the option to keep all index levels even though they are repetitive (like above).

In [13]: df.ix[(0,1,2,3)]
Out[13]: <class 'pandas.core.frame.DataFrame'>
Int64Index: 144 entries, 0 to 4
Data columns:
A5    144  non-null values
A6    144  non-null values
A7    144  non-null values
A8    144  non-null values
A9    144  non-null values
dtypes: int64(5)

Now comes the tricky part. What if I only want to index on the first and last 2 index levels, and want everything from the 3rd level? Empty slicing is not supported.

In [14]: df.ix[(0,1,:,3,4)]
  File "<ipython-input-14-7e44e59fa5b1>", line 1
    df.ix[(0,1,:,3,4)]
               ^
SyntaxError: invalid syntax

In [15]: df.ix[(0,1,slice(None),3,4)]
---------------------------------------------------------------------------
IndexingError                             Traceback (most recent call last)
<ipython-input-15-0e4517ae7fc5> in <module>()
----> 1 df.ix[(0,1,slice(None),3,4)]

C:\Python27\lib\site-packages\pandas\core\indexing.pyc in __getitem__(self, key)
     30                 pass
     31 
---> 32             return self._getitem_tuple(key)
     33         else:
     34             return self._getitem_axis(key, axis=0)

C:\Python27\lib\site-packages\pandas\core\indexing.pyc in _getitem_tuple(self, tup)
    212         for i, key in enumerate(tup):
    213             if i >= self.obj.ndim:
--> 214                 raise IndexingError('Too many indexers')
    215 
    216             if _is_null_slice(key):

IndexingError: Too many indexers

df.xs can somewhat help here but its useless for MultiIndex on a series. And it drops the indexed levels leaving you unsure to what fixed index levels you have drilled to. :(

In [16]: df.xs((0,2,3),level=df.index.names[::2])
Out[16]: <class 'pandas.core.frame.DataFrame'>
MultiIndex: 805 entries, (0, 0) to (4, 4)
Data columns:
A5    805  non-null values
A6    805  non-null values
A7    805  non-null values
A8    805  non-null values
A9    805  non-null values
dtypes: int64(5)

Interestingly df.xs() is not consistant, because you cannot explicitly index on every level giving it the list of all level names:

In [17]: df.xs((0,1,2,3,4), level=df.index.names)
---------------------------------------------------------------------------
AttributeError                            Traceback (most recent call last)
<ipython-input-19-d0e373dfcd5f> in <module>()
----> 1 df.xs((0,1,2,3,4), level=df.index.names)

C:\Python27\lib\site-packages\pandas\core\frame.pyc in xs(self, key, axis, level, copy)
   2233         labels = self._get_axis(axis)
   2234         if level is not None:
-> 2235             loc, new_ax = labels.get_loc_level(key, level=level)
   2236 
   2237             if not copy and not isinstance(loc, slice):

C:\Python27\lib\site-packages\pandas\core\index.pyc in get_loc_level(self, key, level)
   2193 
   2194                 result = loc if result is None else result & loc
-> 2195             return result, _drop_levels(result, level)
   2196 
   2197         level = self._get_level_number(level)

C:\Python27\lib\site-packages\pandas\core\index.pyc in _drop_levels(indexer, levels)
   2177             levels = [self._get_level_number(i) for i in levels]
   2178             for i in sorted(levels, reverse=True):
-> 2179                 new_index = new_index.droplevel(i)
   2180             return new_index
   2181 

AttributeError: 'Int64Index' object has no attribute 'droplevel'

However df.xs without the level attribute on all index levels works as expected...

In [18]: df.xs((0,1,2,3,4))
Out[18]:                 A5  A6  A7  A8  A9
A0 A1 A2 A3 A4                    
0  1  2  3  4    1   3   1   1   3
            4    4   3   4   2   4
            4    0   2   3   1   3
           ...
            4    1   1   3   4   3
            4    0   1   2   4   1

Thoughts:
One (somewhat limiting) solution could be allowing df.ix[(0,1,3,:,4)] to take an empty slice for an index level and return the data frame indexed on only the the passed index levels that are known. Today this capability does not exist, although an ordered partial list of index levels works.

The next and more general approach could be to pass a dict of df.ix[{index_level:value}] pairs and return the rows where the specified index levels equal the passed values. Unspecified levels are not filtered down and we have the option to return all index levels.

Metadata

Metadata

Assignees

No one assigned

    Labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions