Skip to content

merge on index vs merge on column - different NaN handling #13371

Open
@ialong

Description

@ialong

In the following examples there are three different merge behaviours when it comes to handling NaNs.
They are all based on pd.merge(..., how="left", ...)

The difference depends on:

1) Whether we are merging using an index or a column.
2) Whether the column keys we are merging on are the same value or not (i.e. if left_on = right_on).

Arguably, if we specify "left" as the merging criterion, the desired behaviour is to have NaNs in the columns coming from the right dataframe where there is no match between the left and right dataframes' key columns (see first merge in example below, 'd' and 'e' columns).
The problem is, if we are merging on left's index, the NaNs get filled with the index values from the left dataframe even if the names of the two columns don't match ('c' and 'd' in the example). We are thus led to believe there was a perfect match between the index of the left dataframe and the "key" column of the right dataframe ('d' here).

Gotchas:

-There is something puzzling going on with the new indices of the resulting dataframe (when merging on index).
-Type casting occurs when merging on index, perhaps suggesting NaNs are explicitly filled in a second step.

Proposed behaviour:

Maybe it is simply a matter of removing this NaN filling step.

Better yet, the "key" column in the merged dataframe should perhaps bear the name of left's index not of the "right_on" key (provided we used left's index to merge). I.e. in the second merge of the example, the 'd' column should be called 'c'.

This is really the source of the confusion when the two names are different. When they are the same the "no NaN" behaviour is arguably legitimate.

Also it might be worthwhile to cast the final column back to the original dtype if there are no NaNs.

Maybe this is not really an issue though, more something to be aware of. I would be interested in hearing any motivation behind this behaviour.

Looking forward to reading your thoughts!

Code Sample:

df1 = pd.DataFrame(columns=['a','b','c'], data=[[1,2,3],[4,5,6],[7,8,9],[10,11,12]])
df1_c_index = df1.set_index('c')
df2 = pd.DataFrame(columns=['d','e'], data=[[3,14],[6,15],[13,16]])

print 'df1', '\n', df1, '\n'
print 'df1_c_index', '\n', df1_c_index, '\n'
print 'df2', '\n', df2, '\n'

print "pd.merge(df1, df2, how='left', left_on='c', right_on='d')", '\n'
print pd.merge(df1, df2, how='left', left_on='c', right_on='d'), '\n'

print "pd.merge(df1_c_index, df2, how='left', left_index=True, right_on='d')", '\n'
print pd.merge(df1_c_index, df2, how='left', left_index=True, right_on='d'), '\n'

df2.rename(columns={'d':'c'}, inplace=True)

print 'df1', '\n', df1, '\n'
print 'df1_c_index', '\n', df1_c_index, '\n'
print 'df2', '\n', df2, '\n'

print "pd.merge(df1, df2, how='left', left_on='c', right_on='c')", '\n'
print pd.merge(df1, df2, how='left', left_on='c', right_on='c'), '\n'

print "pd.merge(df1_c_index, df2, how='left', left_index=True, right_on='c')", '\n'
print pd.merge(df1_c_index, df2, how='left', left_index=True, right_on='c'), '\n'

Output:

screen shot 2016-06-05 at 21 06 48

screen shot 2016-06-05 at 21 07 05

output of pd.show_versions():

INSTALLED VERSIONS


commit: None
python: 2.7.11.final.0
python-bits: 64
OS: Darwin
OS-release: 15.5.0
machine: x86_64
processor: i386
byteorder: little
LC_ALL: None
LANG: en_GB.UTF-8

pandas: 0.18.1
nose: None
pip: 8.1.2
setuptools: 21.0.0
Cython: None
numpy: 1.11.0
scipy: 0.17.1
statsmodels: None
xarray: 0.7.2
IPython: 4.2.0
sphinx: None
patsy: None
dateutil: 2.5.3
pytz: 2016.4
blosc: None
bottleneck: 1.0.0
tables: None
numexpr: 2.6.0
matplotlib: 1.5.1
openpyxl: None
xlrd: 0.9.4
xlwt: None
xlsxwriter: None
lxml: None
bs4: None
html5lib: None
httplib2: None
apiclient: None
sqlalchemy: None
pymysql: None
psycopg2: None
jinja2: 2.8
boto: 2.39.0
pandas_datareader: None

Metadata

Metadata

Assignees

No one assigned

    Labels

    EnhancementNeeds DiscussionRequires discussion from core team before further actionReshapingConcat, Merge/Join, Stack/Unstack, Explode

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions