Skip to content

BUG: New feature allowing merging on combination of columns and index levels drops levels of index #20452

Closed
@Dr-Irv

Description

@Dr-Irv

Code Sample, a copy-pastable example if possible

In [1]: import pandas as pd

In [2]: pd.__version__
Out[2]: '0.23.0.dev0+657.g01882ba5b'

In [3]: df1 =  pd.DataFrame({'v1' : range(12)}, index=pd.MultiIndex.from_product([list('abc'),list('xy'),[1,2]], names=['abc','xy','num']))
   ...: df1
   ...:
Out[3]:
            v1
abc xy num
a   x  1     0
       2     1
    y  1     2
       2     3
b   x  1     4
       2     5
    y  1     6
       2     7
c   x  1     8
       2     9
    y  1    10
       2    11

In [4]: df2 = pd.DataFrame({'v2': [100*i for i in range(1,7)]}, index=pd.MultiIndex.from_product([list('abc'), list('xy')],names=['abc','xy']))

In [5]: df2
Out[5]:
         v2
abc xy
a   x   100
    y   200
b   x   300
    y   400
c   x   500
    y   600

In [6]: df1.merge(df2, on=['abc','xy'])  # 'num' disappears
Out[6]:
        v1   v2
abc xy
a   x    0  100
    x    1  100
    y    2  200
    y    3  200
b   x    4  300
    x    5  300
    y    6  400
    y    7  400
c   x    8  500
    x    9  500
    y   10  600
    y   11  600

In [7]: df1.reset_index().merge(df2, on=['abc','xy']) # This preserves 'num'
Out[7]:
   abc xy  num  v1   v2
0    a  x    1   0  100
1    a  x    2   1  100
2    a  y    1   2  200
3    a  y    2   3  200
4    b  x    1   4  300
5    b  x    2   5  300
6    b  y    1   6  400
7    b  y    2   7  400
8    c  x    1   8  500
9    c  x    2   9  500
10   c  y    1  10  600
11   c  y    2  11  600

In [8]: df1.merge(df2, on='xy')  # 'abc' and 'num' disappear
Out[8]:
    v1   v2
xy
x    0  100
x    0  300
x    0  500
x    1  100
x    1  300
x    1  500
x    4  100
x    4  300
x    4  500
x    5  100
x    5  300
x    5  500
x    8  100
x    8  300
x    8  500
x    9  100
x    9  300
x    9  500
y    2  200
y    2  400
y    2  600
y    3  200
y    3  400
y    3  600
y    6  200
y    6  400
y    6  600
y    7  200
y    7  400
y    7  600
y   10  200
y   10  400
y   10  600
y   11  200
y   11  400
y   11  600

Problem description

It seems that the new feature implemented in #17484 that allows merging on a combination of columns and index levels can drop index levels, which is really non-intuitive. In the first example, the index level named "num" gets dropped, while in the last example, both "abc" and "xy" are dropped.

If this is the desired behavior, then it needs to be carefully documented.

N.B. There is also an error in the docs of merging.rst that says this feature was introduced in v.0.22, but it will be introduced in v0.23

I'm guessing @jmmease will need to look at this.

Expected Output

In [6]: df1.merge(df2, on=['abc','xy'])
Out[6]:
            v1   v2
abc xy num
a   x  1     0  100
       2     1  100
    y  1     2  200
       2     3  200
b   x  1     4  300
       2     5  300
    y  1     6  400
       2     7  400
c   x  1     8  500
       2     9  500
    y  1    10  600
       2    11  600

In [8]: df1.merge(df2, on='xy')
Out[8]:
   abc_x  num  v1 abc_y   v2
xy
x      a    1   0     a  100
x      a    1   0     b  300
x      a    1   0     c  500
x      a    2   1     a  100
x      a    2   1     b  300
x      a    2   1     c  500
x      b    1   4     a  100
x      b    1   4     b  300
x      b    1   4     c  500
x      b    2   5     a  100
x      b    2   5     b  300
x      b    2   5     c  500
x      c    1   8     a  100
x      c    1   8     b  300
x      c    1   8     c  500
x      c    2   9     a  100
x      c    2   9     b  300
x      c    2   9     c  500
y      a    1   2     a  200
y      a    1   2     b  400
y      a    1   2     c  600
y      a    2   3     a  200
y      a    2   3     b  400
y      a    2   3     c  600
y      b    1   6     a  200
y      b    1   6     b  400
y      b    1   6     c  600
y      b    2   7     a  200
y      b    2   7     b  400
y      b    2   7     c  600
y      c    1  10     a  200
y      c    1  10     b  400
y      c    1  10     c  600
y      c    2  11     a  200
y      c    2  11     b  400
y      c    2  11     c  600

Output of pd.show_versions()

INSTALLED VERSIONS

commit: None
python: 3.6.4.final.0
python-bits: 64
OS: Windows
OS-release: 10
machine: AMD64
processor: Intel64 Family 6 Model 60 Stepping 3, GenuineIntel
byteorder: little
LC_ALL: None
LANG: None
LOCALE: None.None

pandas: 0.23.0.dev0+657.g01882ba5b
pytest: 3.4.0
pip: 9.0.1
setuptools: 38.5.1
Cython: 0.25.1
numpy: 1.14.1
scipy: 1.0.0
pyarrow: 0.8.0
xarray: None
IPython: 6.2.1
sphinx: 1.7.1
patsy: 0.5.0
dateutil: 2.6.1
pytz: 2018.3
blosc: 1.5.1
bottleneck: 1.2.1
tables: 3.4.2
numexpr: 2.6.4
feather: None
matplotlib: 2.2.0
openpyxl: 2.5.0
xlrd: 1.1.0
xlwt: 1.3.0
xlsxwriter: 1.0.2
lxml: 4.1.1
bs4: 4.6.0
html5lib: 1.0.1
sqlalchemy: 1.2.5
pymysql: 0.8.0
psycopg2: None
jinja2: 2.10
s3fs: 0.1.3
fastparquet: None
pandas_gbq: None
pandas_datareader: None

Metadata

Metadata

Assignees

No one assigned

    Labels

    API DesignReshapingConcat, Merge/Join, Stack/Unstack, Explode

    Type

    No type

    Projects

    No projects

    Milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions