Open
Description
Feature Type
-
Adding new functionality to pandas
-
Changing existing functionality in pandas
-
Removing existing functionality in pandas
Problem Description
There is currently no top-level API to easily drop records with duplicate indices in a Series or a Dataframe. Series.drop_duplicates does not work on indices.
Many workarounds have been discussed on stackoverflow, together with perf comparisons:
- https://stackoverflow.com/questions/13035764/remove-pandas-rows-with-duplicate-indices
- https://stackoverflow.com/questions/52105181/consider-duplicate-index-in-drop-duplicates-method-of-a-pandas-dataframe
- https://stackoverflow.com/questions/70429614/pandas-dataframes-remove-duplicate-index-keep-largest-value-first-depending-on
None of them feels "right", in a sense that it is not very readable, while this is a fairly standard operation that would be quite consistent with the drop_duplicates
method (especially for Series!)
Feature Description
It would be convenient in my opinion to
- add a new method
drop_duplicate_indices(keep='first')
that would basically performdf[~df.index.duplicated(keep=keep)]
(from this answer) - document this method so as to explain that users wishing to be "more intelligent" about the records (for example keeping the record with max, min, etc.) to keep should consider using
groupby
or other kind of aggregations. OR, enhance bothdrop_duplicates
anddrop_duplicate_indices
with something as suggested in ENH: keep='random' option for .duplicated and .drop_duplicates #25838 (maybe a separate improvement)
Alternative Solutions
See above in the stackoverflow posts
Additional Context
No response