Skip to content

API/ENH: overhaul/unify/improve .unique #22824

Open
@h-vetinari

Description

@h-vetinari

The state of the various flavours of .unique as of v0.23:

  • [pd/Series/Index].unique does not have keep-kwarg
  • Series.unique returns array, Series.drop_duplicates returns Series. Returning a plain np.ndarray is quite unusual for a Series method, and furthermore the differences between these closely-related methods are confusing from a user perspective, IMO
  • same point for Index
  • DataFrame.unique does not exist, but is a much more natural candidate (from the behaviour of numpy, resp. Series/Index) than .drop_duplicates
  • pd.unique chokes on 2-dimensional data
  • no return_inverse-kwarg for any of the .unique variants; see API: provide a better way of doing np.unique(return_inverses=True) #4087 (milestoned since 0.14), ENH: adding .unique() to DF (or return_inverse for duplicated) #21357

I originally wanted to add df.unique(..., return_inverse=True|False) for #21357, but got directed to add it to duplicated instead. After slow progress over 3 months in #21645 (PR essentially finished since 2), @jorisvandenbossche brought up the - justified (IMO) - feedback that:

I think my main worry is that we are adding a return_inverse keyword which actually does not return the inverse for that function (it does return the inverse for another function), and that it is in name similar to numpy's keyword, but in usage also different.

and

[...] it might make sense to add this to pd.unique / Series.unique as well? (not necessarily at the same time; or might actually be an easier starter)

This prompted me to have another look at the situation with .unique, and I found the list of the above inconsistencies. To resolve them, I suggest to:

  • Change return type for [Series/Index].unique to be same as caller (deprecation cycle by introducing raw=None which at first defaults to True?)
  • Add keep-kwarg to [Series/Index].unique (make .unique a wrapper around .drop_duplicates?)
  • Add df.unique (as thin wrapper around .drop_duplicates?)
  • Add keep-kwarg to pd.unique and dispatch to DataFrame/Series/Index as necessary
  • Add return_inverse-kwarg to all of them (and add to EA interface); under the hood by exposing the same kwarg to duplicated and drop_duplicates as well
  • (something for later) solve BUG: df.duplicated treats None as np.nan in object columns #21720 (treatment of np.nan/None in df.duplicated inconsistent vs. Series behaviour)

Each point is essentially self-contained and independent of the others, but of course they make more sense together.

Metadata

Metadata

Assignees

No one assigned

    Labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions