Skip to content

DataFrame.merge error with empty frame and multiple datetime64[ns, UTC] columns #25014

Closed
@josham

Description

@josham

Code Sample, a copy-pastable example if possible

x = pd.DataFrame([
    [pd.Timestamp('2018-01-01', tz='UTC'), 4.0, pd.Timestamp('2019-01-01', tz='UTC')]
], columns=['date', 'value', 'date2'])
y = x[:0]
y.merge(x, on='date')
Traceback (most recent call last):
  File "/scratch.py", line 8, in <module>
    z = y.merge(x, on='date')
  File "/python/lib/python3.6/site-packages/pandas/core/frame.py", line 6877, in merge
    copy=copy, indicator=indicator, validate=validate)
  File "/python/lib/python3.6/site-packages/pandas/core/reshape/merge.py", line 48, in merge
    return op.get_result()
  File "/python/lib/python3.6/site-packages/pandas/core/reshape/merge.py", line 560, in get_result
    concat_axis=0, copy=self.copy)
  File "/python/lib/python3.6/site-packages/pandas/core/internals/managers.py", line 2061, in concatenate_block_managers
    concatenate_join_units(join_units, concat_axis, copy=copy),
  File "/python/lib/python3.6/site-packages/pandas/core/internals/concat.py", line 240, in concatenate_join_units
    for ju in join_units]
  File "/python/lib/python3.6/site-packages/pandas/core/internals/concat.py", line 240, in <listcomp>
    for ju in join_units]
  File "/python/lib/python3.6/site-packages/pandas/core/internals/concat.py", line 223, in get_reindexed_values
    fill_value=fill_value)
  File "/python/lib/python3.6/site-packages/pandas/core/algorithms.py", line 1579, in take_nd
    return arr.take(indexer, fill_value=fill_value, allow_fill=allow_fill)
  File "/python/lib/python3.6/site-packages/pandas/core/arrays/datetimelike.py", line 589, in take
    fill_value = self._validate_fill_value(fill_value)
  File "/python/lib/python3.6/site-packages/pandas/core/arrays/datetimes.py", line 656, in _validate_fill_value
    "Got '{got}'.".format(got=fill_value))
ValueError: 'fill_value' should be a Timestamp. Got '-9223372036854775808'.

If there is no timezone specified it works as expected:

x = pd.DataFrame([
    [pd.Timestamp('2018-01-01'), 4.0, pd.Timestamp('2019-01-01')]
], columns=['date', 'value', 'date2'])
y = x[:0]
y.merge(x, on='date')
Empty DataFrame
Columns: [value_x, date2_x, date, value_y, date2_y]
Index: []

It also works if there is only one date column:

x = pd.DataFrame([
    [pd.Timestamp('2018-01-01', tz='UTC'), 4.0]
], columns=['date', 'value'])
y = x[:0]
y.merge(x, on='date')
Empty DataFrame
Columns: [value_x, date, value_y]
Index: []

Problem description

It seems like the issue is that iNaT is being passed as the fill_value rather than NaT.

Expected Output

Empty DataFrame
Columns: [value_x, date2_x, date, value_y, date2_y]
Index: []

Output of pd.show_versions()

INSTALLED VERSIONS

commit: None
python: 3.6.8.final.0
python-bits: 64
OS: Linux
OS-release: 4.15.0-43-generic
machine: x86_64
processor: x86_64
byteorder: little
LC_ALL: C.UTF-8
LANG: C.UTF-8
LOCALE: en_US.UTF-8

pandas: 0.24.0
pytest: 4.1.1
pip: 18.1
setuptools: 40.6.3
Cython: 0.29.2
numpy: 1.15.4
scipy: None
pyarrow: None
xarray: None
IPython: 7.2.0
sphinx: 1.8.2
patsy: None
dateutil: 2.7.5
pytz: 2018.9
blosc: None
bottleneck: None
tables: 3.4.4
numexpr: 2.6.9
feather: None
matplotlib: 3.0.2
openpyxl: None
xlrd: None
xlwt: None
xlsxwriter: None
lxml.etree: 4.3.0
bs4: None
html5lib: None
sqlalchemy: 1.2.16
pymysql: None
psycopg2: None
jinja2: 2.10
s3fs: None
fastparquet: None
pandas_gbq: None
pandas_datareader: None
gcsfs: None

Metadata

Metadata

Assignees

No one assigned

    Labels

    BugRegressionFunctionality that used to work in a prior pandas versionTimezonesTimezone data dtype

    Type

    No type

    Projects

    No projects

    Milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions