Open
Description
The below table gives an overview of the result value for:
missing_value in idx
i.e. how Index.__contains__
handles various missing value sentinels as input for the different data types.
dtype | None | nan | <NA> | NaT |
---|---|---|---|---|
object-none | True | False | False | False |
object-nan | False | True | False | False |
object-NA | False | False | True | False |
datetime | True | True | True | True |
period | True | True | True | True |
timedelta | True | True | True | True |
float64 | False | True | False | False |
categorical | True | True | True | True |
interval | True | True | True | False |
nullable_int | False | False | True | False |
nullable_float | False | False | True | False |
string-python | False | False | False | False |
string-pyarrow | False | False | False | False |
str-python | False | False | False | False |
The last three rows with not a single True are specifically problematic, this seems a bug with the StringDtype
But more in general, this is quite inconsistent:
- For object dtype, we require exact match
- For datetimelike and categorical, we match any missing-like
- For interval, we match any missing-like except NaT (also not in case of datetimelike interval dtype)
- For float we only match NaN
- For nullable dtypes (int/float), we only match NA
The code to generate the table above:
import numpy as np
import pandas as pd
# from conftest.py
indices_dict = {
"object-none": pd.Index(["a", None], dtype=object),
"object-nan": pd.Index(["a", np.nan], dtype=object),
"object-NA": pd.Index(["a", pd.NA], dtype=object),
"datetime": pd.DatetimeIndex(["2024-01-01", "NaT"]),
"period": pd.PeriodIndex(["2024-01-01", None], freq="D"),
"timedelta": pd.TimedeltaIndex(["1 days", "NaT"]),
"float64": pd.Index([2.0, np.nan], dtype="float64"),
"categorical": pd.CategoricalIndex(["a", None]),
"interval": pd.IntervalIndex.from_tuples([(1, 2), np.nan]),
"nullable_int": pd.Index([2, None], dtype="Int64"),
"nullable_float": pd.Index([2.0, None], dtype="Float32"),
"string-python": pd.Index(["a", None], dtype="string[python]"),
"string-pyarrow": pd.Index(["a", None], dtype="string[pyarrow]"),
"str-python": pd.Index(["a", None], dtype=pd.StringDtype("pyarrow", na_value=np.nan))
}
results = []
for dtype, data in indices_dict.items():
for val in [None, np.nan, pd.NA, pd.NaT]:
res = val in data
results.append((dtype, str(val), res))
df = pd.DataFrame(results, columns=["dtype", "val", "result"])
df_overview = df.pivot(columns="val", index="dtype", values="result").reindex(columns=df["val"].unique(), index=df["dtype"].unique())
print(df_overview.astype(str).to_markdown())
cc @jbrockmendel I would have expected we had issues about this, but didn't directly find anything