Description
I stumbled across some unexpected behavior when performing a left join on a specified column with time zone aware columns.
# Your code here
df1 = pd.DataFrame({
'date': pd.date_range(start='2018-01-01', periods=5, tz='America/Chicago'),
'vals': list('abcde')}
)
df2 = pd.DataFrame({
'date': pd.date_range(start='2018-01-03', periods=5, tz='America/Chicago'),
'vals_2': list('tuvwx')}
)
df1.join(df2.set_index('date'), on='date')
date vals vals_2
0 2018-01-01 00:00:00-06:00 a NaN
1 2018-01-02 00:00:00-06:00 b NaN
2 2018-01-03 00:00:00-06:00 c NaN
3 2018-01-04 00:00:00-06:00 d NaN
4 2018-01-05 00:00:00-06:00 e NaN
When i was expecting
date vals vals_2
0 2018-01-01 00:00:00-06:00 a NaN
1 2018-01-02 00:00:00-06:00 b NaN
2 2018-01-03 00:00:00-06:00 c t
3 2018-01-04 00:00:00-06:00 d u
4 2018-01-05 00:00:00-06:00 e v
In PR #25260 the test case was specified with all NaN in vals_2 as expected. I don't understand why considering how merge on two columns or join on two indicies work:
df1.set_index('date').join(df2.set_index('date')).reset_index()
pd.merge(df1, df2, on='date', how='left')
Both yield the expected behavior.
Is there something I'm missing or is this inconsistency a bug?
Output of pd.show_versions()
INSTALLED VERSIONS
commit: None
python: 3.7.1.final.0
python-bits: 64
OS: Linux
OS-release: 4.15.0-48-generic
machine: x86_64
processor: x86_64
byteorder: little
LC_ALL: None
LANG: en_US.UTF-8
LOCALE: en_US.UTF-8
pandas: 0.24.2
pytest: 4.1.1
pip: 18.1
setuptools: 40.6.3
Cython: None
numpy: 1.15.4
scipy: 1.1.0
pyarrow: 0.11.1
xarray: None
IPython: 7.2.0
sphinx: 1.8.4
patsy: 0.5.1
dateutil: 2.7.5
pytz: 2018.7
blosc: None
bottleneck: None
tables: 3.4.4
numexpr: 2.6.8
feather: None
matplotlib: 3.0.2
openpyxl: 2.6.1
xlrd: None
xlwt: None
xlsxwriter: None
lxml.etree: None
bs4: None
html5lib: None
sqlalchemy: None
pymysql: None
psycopg2: None
jinja2: 2.10
s3fs: None
fastparquet: 0.1.6
pandas_gbq: None
pandas_datareader: None
gcsfs: None