Skip to content

DOC: Add example of drop_duplicates dropping a first-level #47813

Open
@smarie

Description

@smarie

Feature Type

  • Adding new functionality to pandas

  • Changing existing functionality in pandas

  • Removing existing functionality in pandas

Problem Description

There is currently no top-level API to easily drop records with duplicate indices in a Series or a Dataframe. Series.drop_duplicates does not work on indices.

Many workarounds have been discussed on stackoverflow, together with perf comparisons:

None of them feels "right", in a sense that it is not very readable, while this is a fairly standard operation that would be quite consistent with the drop_duplicates method (especially for Series!)

Feature Description

It would be convenient in my opinion to

  • add a new method drop_duplicate_indices(keep='first') that would basically perform df[~df.index.duplicated(keep=keep)] (from this answer)
  • document this method so as to explain that users wishing to be "more intelligent" about the records (for example keeping the record with max, min, etc.) to keep should consider using groupby or other kind of aggregations. OR, enhance both drop_duplicates and drop_duplicate_indices with something as suggested in ENH: keep='random' option for .duplicated and .drop_duplicates #25838 (maybe a separate improvement)

Alternative Solutions

See above in the stackoverflow posts

Additional Context

No response

Metadata

Metadata

Assignees

No one assigned

    Labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions