Skip to content

BUG: Index.get_indexer with mixed-reso datetime64s #50690

Open
@jbrockmendel

Description

@jbrockmendel
import numpy as np
import pandas as pd

ms = np.datetime64(1, "ms")
us = np.datetime64(1000, "us")

left = pd.Index([ms], dtype=object)
right = pd.Index([us], dtype=object)

assert left[0] == right[0]
assert (left == right).all()

>>> left[0] in right  # <- wrong
False
>>> right[0] in left  # <- wrong
False

>>> left.get_loc(right[0])  # <- raises, incorrectly
>>> right.get_loc(left[0])  # <- raises, incorrectly

>>> left.get_indexer(right)  # works correctly AFAICT bc it doesnt use hashtable

# But in a non-monotonic case...
sec = np.datetime64("9999-01-01", "s")
day = np.datetime64("2016-01-01", "D")
left2 = pd.Index([ms, sec, day], dtype=object)

>>> left2[:1].get_indexer(right)
array([0])
>>> left2.get_indexer(right)  # <- wrong
array([-1])

IIUC the issue is in the hashing of the datetime64 objects, which do not follow the invariance x == y \Rightarrow hash(x) == hash(y) (xref numpy/numpy#3836)

When implementing non-nanosecond support for Timestamp/Timedelta, we implemented __hash__ to retain this invariance (at the cost of performance).

Unless numpy changes its behavior, I think to fix this we need to patch how we treat datetime64 objects in our khash code, likely mirroring Timestamp.__hash__. cc @realead thoughts?

Metadata

Metadata

Assignees

No one assigned

    Labels

    BugIndexingRelated to indexing on series/frames, not to indexes themselvesNon-Nanodatetime64/timedelta64 with non-nanosecond resolutionUpstream issueIssue related to pandas dependency

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions