Open
Description
import numpy as np
import pandas as pd
ms = np.datetime64(1, "ms")
us = np.datetime64(1000, "us")
left = pd.Index([ms], dtype=object)
right = pd.Index([us], dtype=object)
assert left[0] == right[0]
assert (left == right).all()
>>> left[0] in right # <- wrong
False
>>> right[0] in left # <- wrong
False
>>> left.get_loc(right[0]) # <- raises, incorrectly
>>> right.get_loc(left[0]) # <- raises, incorrectly
>>> left.get_indexer(right) # works correctly AFAICT bc it doesnt use hashtable
# But in a non-monotonic case...
sec = np.datetime64("9999-01-01", "s")
day = np.datetime64("2016-01-01", "D")
left2 = pd.Index([ms, sec, day], dtype=object)
>>> left2[:1].get_indexer(right)
array([0])
>>> left2.get_indexer(right) # <- wrong
array([-1])
IIUC the issue is in the hashing of the datetime64 objects, which do not follow the invariance x == y \Rightarrow hash(x) == hash(y)
(xref numpy/numpy#3836)
When implementing non-nanosecond support for Timestamp/Timedelta, we implemented __hash__
to retain this invariance (at the cost of performance).
Unless numpy changes its behavior, I think to fix this we need to patch how we treat datetime64 objects in our khash code, likely mirroring Timestamp.__hash__
. cc @realead thoughts?