Skip to content

ENH: keep='random' option for .duplicated and .drop_duplicates #25838

Closed
@dylanjaide

Description

@dylanjaide

Enhancement description

The present implementation of df.duplicated() (and hence df.drop_duplicates()) only has two options for users that wish to keep exactly one from each set of duplicates ('first' and 'last'). In some use cases, if the data is already ordered in some way, these options can potentially introduce bias. It would be useful to have an option that allows for the duplicate selected to be randomly (but also deterministically) chosen.

Details

In our use case, we are able to produce the desired result by other means: we wish to remove all-but-one events with a duplicate in the 'eventNumber' column, so we introduce an additional 'random number' column using pd.utils.hash_pandas_object(df, index=False). We then sort the dataframe by this column, apply df.drop_duplicates() using either keep='first' or keep='last', and then sort by index again (thanks to @chrisburr for this solution).

By using a hash instead of a standard RNG, the numbers used in the sorting are deterministic & repeatable. It also remains stable when entries are added/removed but the rest are not modified, which is desirable not not necessarily required. But this method requires the hash having knowledge of another subset of the columns in which there are no duplicates, and so would require the underlying functions (the duplicated_{{dtype}} functions in pandas/_libs/hashtable_func_helper.pxi.in) to receive an additional argument.

Metadata

Metadata

Assignees

No one assigned

    Labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions