Description
Pandas version checks
-
I have checked that this issue has not already been reported.
-
I have confirmed this bug exists on the latest version of pandas.
-
I have confirmed this bug exists on the main branch of pandas.
Reproducible Example
import pandas as pd
df = pd.DataFrame(
data=[[1, 2, 3]],
columns=pd.MultiIndex.from_tuples(
[
("MAT", pd.Timestamp("2021-12-01"), "a"),
("ignore", pd.Timestamp("1970-12-01"), "a"),
("ignore", pd.NaT, "a"),
],
names=("date_type", "date", "value_type"),
),
)
unique_dates_v1 = df.columns.get_level_values("date")[
df.columns.get_level_values("date_type") == "MAT"
].unique()
unique_dates_via_stack = (
df.stack(df.columns.names)
.xs("MAT", level="date_type")
.index.get_level_values("date")
.unique()
)
print(pd.__version__)
print("v1", unique_dates_v1)
print("v2", unique_dates_via_stack)
assert all(unique_dates_v1 == pd.Timestamp("2021-12-01"))
assert all(unique_dates_via_stack == pd.Timestamp("2021-12-01"))
assert unique_dates_v1.equals(unique_dates_via_stack)
print("all ok")
Issue Description
First of all, sorry for the rather complex dataframe. It was already quite challenging to reduce it from the one I was actually using...
Let us consider a DataFrame with a column MultiIndex where a NaT happens to appear in one of the indexes.
Let's try to find out the timestamps where date_type == "MAT"
. This can be done in two ways:
a) unique_dates_v1
: here it is a simple cut using get_level_values
- works fine
b) unique_dates_via_stack
: by stacking all the columns, thus making a series where a cross section can then give us the result. This is the version failing from pandas >= 2.1.0
I know there is future_stack=True
in newer pandas versions - and the future_stack
seems to work fine (and is usually what I prefer). However, the error above was caused when migrating older code. The stack
version simply returns wrong data. There is no MAT entry at all with a 1970 date. Even if the old stack variant introduces additional NaNs, it should never return wrong data, not even in a deprecated stack implementation.
Expected Behavior
behavior as in pandas 2.0, i.e. not assigning wrong data to MAT
Installed Versions
works fine with pandas <= 2.0.3
fails with pandas >= 2.1.0