Skip to content

API: handling of missing values in Index.__contains__ #59765

Open
@jorisvandenbossche

Description

@jorisvandenbossche

The below table gives an overview of the result value for:

missing_value in idx

i.e. how Index.__contains__ handles various missing value sentinels as input for the different data types.

dtype None nan <NA> NaT
object-none True False False False
object-nan False True False False
object-NA False False True False
datetime True True True True
period True True True True
timedelta True True True True
float64 False True False False
categorical True True True True
interval True True True False
nullable_int False False True False
nullable_float False False True False
string-python False False False False
string-pyarrow False False False False
str-python False False False False

The last three rows with not a single True are specifically problematic, this seems a bug with the StringDtype

But more in general, this is quite inconsistent:

  • For object dtype, we require exact match
  • For datetimelike and categorical, we match any missing-like
  • For interval, we match any missing-like except NaT (also not in case of datetimelike interval dtype)
  • For float we only match NaN
  • For nullable dtypes (int/float), we only match NA

The code to generate the table above:

import numpy as np
import pandas as pd

# from conftest.py
indices_dict = {
    "object-none": pd.Index(["a", None], dtype=object),
    "object-nan": pd.Index(["a", np.nan], dtype=object),
    "object-NA": pd.Index(["a", pd.NA], dtype=object),
    "datetime": pd.DatetimeIndex(["2024-01-01", "NaT"]),
    "period": pd.PeriodIndex(["2024-01-01", None], freq="D"),
    "timedelta": pd.TimedeltaIndex(["1 days", "NaT"]),
    "float64": pd.Index([2.0, np.nan], dtype="float64"),
    "categorical": pd.CategoricalIndex(["a", None]),
    "interval": pd.IntervalIndex.from_tuples([(1, 2), np.nan]),
    "nullable_int": pd.Index([2, None], dtype="Int64"),
    "nullable_float": pd.Index([2.0, None], dtype="Float32"),
    "string-python": pd.Index(["a", None], dtype="string[python]"),
    "string-pyarrow": pd.Index(["a", None], dtype="string[pyarrow]"),
    "str-python": pd.Index(["a", None], dtype=pd.StringDtype("pyarrow", na_value=np.nan))
}

results = []

for dtype, data in indices_dict.items():
    for val in [None, np.nan, pd.NA, pd.NaT]:
        res = val in data
        results.append((dtype, str(val), res))
        
df = pd.DataFrame(results, columns=["dtype", "val", "result"])
df_overview = df.pivot(columns="val", index="dtype", values="result").reindex(columns=df["val"].unique(), index=df["dtype"].unique())

print(df_overview.astype(str).to_markdown())

cc @jbrockmendel I would have expected we had issues about this, but didn't directly find anything

Metadata

Metadata

Assignees

No one assigned

    Labels

    API - ConsistencyInternal Consistency of API/BehaviorIndexRelated to the Index class or subclassesMissing-datanp.nan, pd.NaT, pd.NA, dropna, isnull, interpolate

    Type

    No type

    Projects

    No projects

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions