Skip to content

BUG: df.stack() returns wrong data when NaT is in index (regression since 2.1.0, ok in <= 2.0.3) #57152

Open
@behrenhoff

Description

@behrenhoff

Pandas version checks

  • I have checked that this issue has not already been reported.

  • I have confirmed this bug exists on the latest version of pandas.

  • I have confirmed this bug exists on the main branch of pandas.

Reproducible Example

import pandas as pd

df = pd.DataFrame(
    data=[[1, 2, 3]],
    columns=pd.MultiIndex.from_tuples(
        [
            ("MAT", pd.Timestamp("2021-12-01"), "a"),
            ("ignore", pd.Timestamp("1970-12-01"), "a"),
            ("ignore", pd.NaT, "a"),
        ],
        names=("date_type", "date", "value_type"),
    ),
)

unique_dates_v1 = df.columns.get_level_values("date")[
    df.columns.get_level_values("date_type") == "MAT"
].unique()

unique_dates_via_stack = (
    df.stack(df.columns.names)
    .xs("MAT", level="date_type")
    .index.get_level_values("date")
    .unique()
)

print(pd.__version__)
print("v1", unique_dates_v1)
print("v2", unique_dates_via_stack)

assert all(unique_dates_v1 == pd.Timestamp("2021-12-01"))
assert all(unique_dates_via_stack == pd.Timestamp("2021-12-01"))
assert unique_dates_v1.equals(unique_dates_via_stack)
print("all ok")

Issue Description

First of all, sorry for the rather complex dataframe. It was already quite challenging to reduce it from the one I was actually using...

Let us consider a DataFrame with a column MultiIndex where a NaT happens to appear in one of the indexes.

Let's try to find out the timestamps where date_type == "MAT". This can be done in two ways:
a) unique_dates_v1: here it is a simple cut using get_level_values - works fine
b) unique_dates_via_stack: by stacking all the columns, thus making a series where a cross section can then give us the result. This is the version failing from pandas >= 2.1.0

I know there is future_stack=True in newer pandas versions - and the future_stack seems to work fine (and is usually what I prefer). However, the error above was caused when migrating older code. The stack version simply returns wrong data. There is no MAT entry at all with a 1970 date. Even if the old stack variant introduces additional NaNs, it should never return wrong data, not even in a deprecated stack implementation.

Expected Behavior

behavior as in pandas 2.0, i.e. not assigning wrong data to MAT

Installed Versions

works fine with pandas <= 2.0.3
fails with pandas >= 2.1.0

Metadata

Metadata

Assignees

No one assigned

    Labels

    BugRegressionFunctionality that used to work in a prior pandas versionReshapingConcat, Merge/Join, Stack/Unstack, Explode

    Type

    No type

    Projects

    No projects

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions