Skip to content

BUG: df.duplicated treats None as np.nan in object columns #21720

Open
@h-vetinari

Description

@h-vetinari

Found out while writing tests for .duplicated in #21645 (so far, .duplicated was almost exclusively tested implicitly through .drop_duplicates)

At first I thought this is intended behaviour for DataFrame.duplicated(), but Series.duplicated() does not treat it equally. This makes sense to me, since as objects, None is not np.nan - I therefore labelled this as a bug.

s = pd.Series([np.nan, 3, 3, None, np.nan], dtype=object)
s
# 0     NaN
# 1       3
# 2       3
# 3    None
# 4     NaN
# dtype: object

s.duplicated()
# 0    False
# 1    False
# 2     True
# 3    False
# 4     True
# dtype: bool

s.to_frame().duplicated()
# 0    False
# 1    False
# 2     True
# 3     True  <- CHANGED
# 4     True
# dtype: bool

Metadata

Metadata

Assignees

No one assigned

    Labels

    BugMissing-datanp.nan, pd.NaT, pd.NA, dropna, isnull, interpolateduplicatedduplicated, drop_duplicates

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions