Description
In the following examples there are three different merge behaviours when it comes to handling NaNs.
They are all based on pd.merge(..., how="left", ...)
The difference depends on:
1) Whether we are merging using an index or a column.
2) Whether the column keys we are merging on are the same value or not (i.e. if left_on = right_on).
Arguably, if we specify "left" as the merging criterion, the desired behaviour is to have NaNs in the columns coming from the right dataframe where there is no match between the left and right dataframes' key columns (see first merge in example below, 'd' and 'e' columns).
The problem is, if we are merging on left's index, the NaNs get filled with the index values from the left dataframe even if the names of the two columns don't match ('c' and 'd' in the example). We are thus led to believe there was a perfect match between the index of the left dataframe and the "key" column of the right dataframe ('d' here).
Gotchas:
-There is something puzzling going on with the new indices of the resulting dataframe (when merging on index).
-Type casting occurs when merging on index, perhaps suggesting NaNs are explicitly filled in a second step.
Proposed behaviour:
Maybe it is simply a matter of removing this NaN filling step.
Better yet, the "key" column in the merged dataframe should perhaps bear the name of left's index not of the "right_on" key (provided we used left's index to merge). I.e. in the second merge of the example, the 'd' column should be called 'c'.
This is really the source of the confusion when the two names are different. When they are the same the "no NaN" behaviour is arguably legitimate.
Also it might be worthwhile to cast the final column back to the original dtype if there are no NaNs.
Maybe this is not really an issue though, more something to be aware of. I would be interested in hearing any motivation behind this behaviour.
Looking forward to reading your thoughts!
Code Sample:
df1 = pd.DataFrame(columns=['a','b','c'], data=[[1,2,3],[4,5,6],[7,8,9],[10,11,12]])
df1_c_index = df1.set_index('c')
df2 = pd.DataFrame(columns=['d','e'], data=[[3,14],[6,15],[13,16]])
print 'df1', '\n', df1, '\n'
print 'df1_c_index', '\n', df1_c_index, '\n'
print 'df2', '\n', df2, '\n'
print "pd.merge(df1, df2, how='left', left_on='c', right_on='d')", '\n'
print pd.merge(df1, df2, how='left', left_on='c', right_on='d'), '\n'
print "pd.merge(df1_c_index, df2, how='left', left_index=True, right_on='d')", '\n'
print pd.merge(df1_c_index, df2, how='left', left_index=True, right_on='d'), '\n'
df2.rename(columns={'d':'c'}, inplace=True)
print 'df1', '\n', df1, '\n'
print 'df1_c_index', '\n', df1_c_index, '\n'
print 'df2', '\n', df2, '\n'
print "pd.merge(df1, df2, how='left', left_on='c', right_on='c')", '\n'
print pd.merge(df1, df2, how='left', left_on='c', right_on='c'), '\n'
print "pd.merge(df1_c_index, df2, how='left', left_index=True, right_on='c')", '\n'
print pd.merge(df1_c_index, df2, how='left', left_index=True, right_on='c'), '\n'
Output:
output of pd.show_versions()
:
INSTALLED VERSIONS
commit: None
python: 2.7.11.final.0
python-bits: 64
OS: Darwin
OS-release: 15.5.0
machine: x86_64
processor: i386
byteorder: little
LC_ALL: None
LANG: en_GB.UTF-8
pandas: 0.18.1
nose: None
pip: 8.1.2
setuptools: 21.0.0
Cython: None
numpy: 1.11.0
scipy: 0.17.1
statsmodels: None
xarray: 0.7.2
IPython: 4.2.0
sphinx: None
patsy: None
dateutil: 2.5.3
pytz: 2016.4
blosc: None
bottleneck: 1.0.0
tables: None
numexpr: 2.6.0
matplotlib: 1.5.1
openpyxl: None
xlrd: 0.9.4
xlwt: None
xlsxwriter: None
lxml: None
bs4: None
html5lib: None
httplib2: None
apiclient: None
sqlalchemy: None
pymysql: None
psycopg2: None
jinja2: 2.8
boto: 2.39.0
pandas_datareader: None