Skip to content

.loc on Hierarchical Index with single-valued index level can drop that index level in place #13842

Closed
@mborysow

Description

@mborysow

Small Example

In [13]: import pandas as pd
    ...:
    ...: df1 = pd.DataFrame(data=dict(A=[1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1],
    ...:                              B=[1, 1, 2, 2, 2, 3, 1, 1, 1, 2, 3, 4],
    ...:                              C=[1, 2, 1, 2, 3, 2, 1, 2, 3, 2, 1, 4],
    ...:                              X=[1, 5, 2, 3, 8, 3, 3, 3, 1, 2, 1, 4],
    ...:                              Y=[7, 3, 4, 1, 3, 9, 9, 3, 1, 9, 3, 7]))
    ...: df1 = df1.set_index(['A', 'B', 'C'])
    ...:

In [14]: df1.loc[pd.IndexSlice[1, :, :]]
Out[14]:
     X  Y
B C
1 1  1  7
  2  5  3
2 1  2  4
  2  3  1
  3  8  3
..  .. ..
1 2  3  3
  3  1  1
2 2  2  9
3 1  1  3
4 4  4  7

[12 rows x 2 columns]

In [15]: df1
Out[15]:
     X  Y
B C
1 1  1  7
  2  5  3
2 1  2  4
  2  3  1
  3  8  3
..  .. ..
1 2  3  3
  3  1  1
2 2  2  9
3 1  1  3
4 4  4  7

[12 rows x 2 columns]

Expected Output

The output of the slice Out[14] is correct, but df1 should not be modified inplace. So the expected Out[15] is the original df1:

In [17]: df1
Out[17]:
       X  Y
A B C
1 1 1  1  7
    2  5  3
  2 1  2  4
    2  3  1
    3  8  3
...   .. ..
  1 2  3  3
    3  1  1
  2 2  2  9
  3 1  1  3
  4 4  4  7

[12 rows x 2 columns]

I'm still not good at submitting issues here with code and print out, so I appreciate your patience. Also, thank you guys for making pandas as amazing as it is!!

Anyhow...

I have dataframes that sometimes have up to 5 levels on their multiindex. It's not uncommon for me to want to just grab a subset containing only one value on a certain level. If one level of that index has only one value, then .loc can drop that level inplace. I'd say this is highly undesirable.

First the normal behavior. Here's my input:

       X  Y
A B C      
1 1 1  1  7
    2  5  3
  2 1  2  4
    2  3  1
    3  8  3
  3 2  3  9
2 1 1  3  9
    2  3  3
    3  1  1
  2 2  2  9
  3 1  1  3
  4 4  4  7

When I have a multi-indexed dataframe, and I do:
df.loc[1]
I get:

     X  Y
B C      
1 1  1  7
  2  5  3
2 1  2  4
  2  3  1
  3  8  3
3 2  3  9

I personally expect it to return the original multi-index where the first level has only that value. Sadly, it drops it entirely ( I think this is terrible, since if you plan on resetting the index or concatenating later, you've just unwittingly lost information).

Anyhow, I recognize now that you need to provide an index for all levels, e.g., the way I expected it to work can actually be achieved by (for a three level index):
df.loc[pd.IndexSlice[1, :, :]]

       X  Y
A B C      
1 1 1  1  7
    2  5  3
  2 1  2  4
    2  3  1
    3  8  3
  3 2  3  9

Here's the rub... If the level that I indexed above has more than one unique value, this works fine. If it has only one, then once again that level gets dropped, but worse, the index is modified in place during the .loc operation.
Here's the dataframe showing the bad behavior:

       X  Y
A B C      
1 1 1  1  7
    1  3  9
    2  5  3
    2  3  3
    3  1  1
  2 1  2  4
    2  3  1
    2  2  9
    3  8  3
  3 1  1  3
    2  3  9
  4 4  4  7

df.loc[pd.IndexSlice[1, :, :]] gives:

     X  Y
B C      
1 1  1  7
  1  3  9
  2  5  3
  2  3  3
  3  1  1
2 1  2  4
  2  3  1
  2  2  9
  3  8  3
3 1  1  3
  2  3  9
4 4  4  7

Same syntax as the other case, but it dropped index A. Worse is that this is now df.
print(df)

     X  Y
B C      
1 1  1  7
  1  3  9
  2  5  3
  2  3  3
  3  1  1
2 1  2  4
  2  3  1
  2  2  9
  3  8  3
3 1  1  3
  2  3  9
4 4  4  7

If I modify the syntax slightly. I.e., df.loc[pd.IndexSlice[1, :, :], :] (with the original not modifed frame, I get the expected result:

      X  Y
A B C      
1 1 1  1  7
    1  3  9
    2  5  3
    2  3  3
    3  1  1
  2 1  2  4
    2  3  1
    2  2  9
    3  8  3
  3 1  1  3
    2  3  9
  4 4  4  7

I've tried to provide a code sample with comments that demonstrates the problem.

Code Sample, a copy-pastable example if possible

import pandas as pd

df1 = pd.DataFrame(data=dict(A=[1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1],
                             B=[1, 1, 2, 2, 2, 3, 1, 1, 1, 2, 3, 4],
                             C=[1, 2, 1, 2, 3, 2, 1, 2, 3, 2, 1, 4],
                             X=[1, 5, 2, 3, 8, 3, 3, 3, 1, 2, 1, 4],
                             Y=[7, 3, 4, 1, 3, 9, 9, 3, 1, 9, 3, 7]))
df2 = df1.copy(deep=True)
df2['A'] = [1, 1, 1, 1, 1, 1, 2, 2, 2, 2, 2, 2]

df1 = df1.set_index(['A', 'B', 'C']).sortlevel()
df2 = df2.set_index(['A', 'B', 'C']).sortlevel()
df1_copy = df1.copy()

print("Here's df2, with more than 1 unique value for the index A:")
print(df2)

# already annoyed by this, I don't think this is how it should work, but I understand it
print("\nHere's what df2.loc[1] returns")
print(df2.loc[1])

# understand how to get around it at least
print("\nCan get around this annoyance by df2.loc[pd.IndexSlice[1, :, :]]")
print(df2.loc[pd.IndexSlice[1, :, :]])

# BUT!  If it's the only one...
print("\nHere's df1, with only a single value for the index A")
print(df1)

print("\nNow let's do the same thing we did for df2, namely display df1.loc[pidx[1, :, :]]")
print(df1.loc[pd.IndexSlice[1, :, :]])

# and holy crap it's an inplace operation!
print("\nDamnit.. it dropped by index again! And... ruh roh!  It has a side effect!  Here's df1 again:")
print(df1)

print("\nDoing df1.loc[pidx[1, :, :], :] (using the original df1) works as expected.")
print(df1_copy.loc[pd.IndexSlice[1, :, :], :])

Here's what I get from running the code

Here's df2, with more than 1 unique value for the index A:
       X  Y
A B C      
1 1 1  1  7
    2  5  3
  2 1  2  4
    2  3  1
    3  8  3
  3 2  3  9
2 1 1  3  9
    2  3  3
    3  1  1
  2 2  2  9
  3 1  1  3
  4 4  4  7

Here's what df2.loc[1] returns
     X  Y
B C      
1 1  1  7
  2  5  3
2 1  2  4
  2  3  1
  3  8  3
3 2  3  9

Can get around this annoyance by df2.loc[pd.IndexSlice[1, :, :]]
       X  Y
A B C      
1 1 1  1  7
    2  5  3
  2 1  2  4
    2  3  1
    3  8  3
  3 2  3  9

Here's df1, with only a single value for the index A
       X  Y
A B C      
1 1 1  1  7
    1  3  9
    2  5  3
    2  3  3
    3  1  1
  2 1  2  4
    2  3  1
    2  2  9
    3  8  3
  3 1  1  3
    2  3  9
  4 4  4  7

Now let's do the same thing we did for df2, namely display df1.loc[pidx[1, :, :]]
     X  Y
B C      
1 1  1  7
  1  3  9
  2  5  3
  2  3  3
  3  1  1
2 1  2  4
  2  3  1
  2  2  9
  3  8  3
3 1  1  3
  2  3  9
4 4  4  7

Damnit.. it dropped by index again! And... ruh roh!  It has a side effect!  Here's df1 again:
     X  Y
B C      
1 1  1  7
  1  3  9
  2  5  3
  2  3  3
  3  1  1
2 1  2  4
  2  3  1
  2  2  9
  3  8  3
3 1  1  3
  2  3  9
4 4  4  7

Doing df1.loc[pidx[1, :, :], :] (using the original df1) works as expected.
       X  Y
A B C      
1 1 1  1  7
    1  3  9
    2  5  3
    2  3  3
    3  1  1
  2 1  2  4
    2  3  1
    2  2  9
    3  8  3
  3 1  1  3
    2  3  9
  4 4  4  7

Expected Output

What I expect from all of the examples above, is:

       X  Y
A B C      
1 1 1  1  7
    2  5  3
  2 1  2  4
    2  3  1
    3  8  3
  3 2  3  9

output of pd.show_versions()

INSTALLED VERSIONS

commit: None
python: 2.7.11.final.0
python-bits: 64
OS: Linux
OS-release: 4.6.3-300.fc24.x86_64
machine: x86_64
processor: x86_64
byteorder: little
LC_ALL: None
LANG: en_US.UTF-8

pandas: 0.18.1
nose: None
pip: 8.0.3
setuptools: 20.1.1
Cython: 0.23.4
numpy: 1.10.4
scipy: 0.17.0
statsmodels: None
xarray: None
IPython: 4.1.2
sphinx: None
patsy: None
dateutil: 2.4.2
pytz: 2015.7
blosc: None
bottleneck: None
tables: 3.2.2
numexpr: 2.5.1
matplotlib: 1.5.0
openpyxl: None
xlrd: None
xlwt: None
xlsxwriter: None
lxml: None
bs4: None
html5lib: None
httplib2: None
apiclient: None
sqlalchemy: 1.0.10
pymysql: None
psycopg2: None
jinja2: 2.8
boto: None
pandas_datareader: None

Metadata

Metadata

Assignees

No one assigned

    Labels

    BugIndexingRelated to indexing on series/frames, not to indexes themselvesMultiIndex

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions