Description
Enhancement description
The present implementation of df.duplicated()
(and hence df.drop_duplicates()
) only has two options for users that wish to keep exactly one from each set of duplicates ('first'
and 'last'
). In some use cases, if the data is already ordered in some way, these options can potentially introduce bias. It would be useful to have an option that allows for the duplicate selected to be randomly (but also deterministically) chosen.
Details
In our use case, we are able to produce the desired result by other means: we wish to remove all-but-one events with a duplicate in the 'eventNumber' column, so we introduce an additional 'random number' column using pd.utils.hash_pandas_object(df, index=False)
. We then sort the dataframe by this column, apply df.drop_duplicates()
using either keep='first' or keep='last', and then sort by index again (thanks to @chrisburr for this solution).
By using a hash instead of a standard RNG, the numbers used in the sorting are deterministic & repeatable. It also remains stable when entries are added/removed but the rest are not modified, which is desirable not not necessarily required. But this method requires the hash having knowledge of another subset of the columns in which there are no duplicates, and so would require the underlying functions (the duplicated_{{dtype}}
functions in pandas/_libs/hashtable_func_helper.pxi.in
) to receive an additional argument.