Closed
Description
from mailing list
I have a sort of philosophical question about the use of
indexes (especially MultiIndexes) versus just keeping data in
columns. When using groupby, you tend to get a lot of results
with MultiIndexes, and the indexes are convenient for simple
accessing of items. However, I've found that index objects lack
key features of ordinary columns. I often find myself swapping a
particular dimension back and forth from index to column, either
because I need one or other, or because Pandas gives me one when
I want the other. What I'm wondering is if I'm using pandas in a
nonidiomatic way, or if there's some way to get around these
difficulties I'm having, or what.
The three main things I've noticed right now to be irritating
about Index objects are:
A) Extracting and using the level values is awkward. When I have
a column, I can get the values just with df.SomeCol or
df['SomeCol']. For indexes, I have to do
df.index.get_level_values('IndexLevel'), and even then I just get
another Index instance, which I may have to convert to a series
for other things, because. . .
B) Indexes do not support the convenient convenient operations on
Series, in particular Series.map. This means that, although I
can easily do df1.ix[df.SomeCol.map(someThingElse)], I cannot do
this when SomeCol is an index instead of just a column in the
data. I have to extract the index level values as above and then
convert to a series before I can map them.
C) There doesn't appear to be a way to group a DataFrame by a
combination of columns and index levels. groupby allows a "by"
argument for columns and a "level" argument for index levels, but
using both gives an error. Even if I could do this, it's not
clear how I would specify the order of the grouping.
The solutions that come to mind for these problems are: A) give
MultiIndex objects a simple means of accessing the level values
as a Series. Something like df.index.levels.Level or df
index.levels['Level']. Basically make MultiIndexes indexable in
somewhat the same way that DataFrames already are. B) Give
Indexes a map-like operator, and maybe some of the other useful
stuff from Series. C) Provide some way of grouping using both
columns and index levels. Maybe some sort of "IndexGroup" class
that would wrap a level name, so you could do groupby(["Column",
IG("IndexLevel"), "OtherColumn"]) to insert an index level in the
grouping order.
Pandas provides a lot of functionality for slicing and dicing the
data in the different ways, but I feel like sometimes I'm forced
to slice it and dice it back and forth by converting indexes to
columns and vice versa instead of being able to directly access
what I want. I'd be interested to hear how/whether other people
deal with these issues. Are there ways of doing these things
that I'm missing?