ENH: keep='random' option for .duplicated and .drop_duplicates

#### Enhancement description

The present implementation of `df.duplicated()` (and hence `df.drop_duplicates()`) only has two options for users that wish to keep exactly one from each set of duplicates (`'first'` and `'last'`). In some use cases, if the data is already ordered in some way, these options can potentially introduce bias. It would be useful to have an option that allows for the duplicate selected to be randomly (but also deterministically) chosen.

#### Details

In our use case, we are able to produce the desired result by other means: we wish to remove all-but-one events with a duplicate in the 'eventNumber' column, so we introduce an additional 'random number' column using `pd.utils.hash_pandas_object(df, index=False)`. We then sort the dataframe by this column, apply `df.drop_duplicates()` using either keep='first' or keep='last', and then sort by index again (thanks to @chrisburr for this solution).

By using a hash instead of a standard RNG, the numbers used in the sorting are deterministic & repeatable. It also remains stable when entries are added/removed but the rest are not modified, which is desirable not not necessarily required. But this method requires the hash having knowledge of another subset of the columns in which there are no duplicates, and so would require the underlying functions (the `duplicated_{{dtype}}` functions in `pandas/_libs/hashtable_func_helper.pxi.in`) to receive an additional argument.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

ENH: keep='random' option for .duplicated and .drop_duplicates #25838

Enhancement description

Details

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

ENH: keep='random' option for .duplicated and .drop_duplicates #25838

Description

Enhancement description

Details

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions