API/ENH: overhaul/unify/improve .unique

The state of the various flavours of `.unique` as of `v0.23`:
- `[pd/Series/Index].unique` does not have `keep`-kwarg
- `Series.unique` returns array, `Series.drop_duplicates` returns `Series`. Returning a plain `np.ndarray` is quite unusual for a `Series` method, and furthermore the differences between these closely-related methods are confusing from a user perspective, IMO
- same point for `Index`
- `DataFrame.unique` does not exist, but is a much more natural candidate (from the behaviour of numpy, resp. `Series/Index`) than `.drop_duplicates`
- `pd.unique` chokes on 2-dimensional data
- no `return_inverse`-kwarg for any of the `.unique` variants; see #4087 (milestoned since 0.14), #21357

I originally wanted to add `df.unique(..., return_inverse=True|False)` for #21357, but got directed to add it to `duplicated` instead. After slow progress over 3 months in #21645 (PR essentially finished since 2), @jorisvandenbossche brought up the - justified (IMO) - feedback that: 
> I think my main worry is that we are adding a `return_inverse` keyword which actually does not return the inverse for that function (it does return the inverse for another function), and that it is in name similar to numpy's keyword, but in usage also different.

and
> [...] it might make sense to add this to `pd.unique` / `Series.unique` as well? (not necessarily at the same time; or might actually be an easier starter)

This prompted me to have another look at the situation with `.unique`, and I found the list of the above inconsistencies. To resolve them, I suggest to:
- [ ] Change return type for `[Series/Index].unique` to be same as caller (deprecation cycle by introducing `raw=None` which at first defaults to True?)
- [ ] Add `keep`-kwarg to `[Series/Index].unique` (make `.unique` a wrapper around `.drop_duplicates`?)
- [ ] Add `df.unique` (as thin wrapper around `.drop_duplicates`?)
- [ ] Add `keep`-kwarg to `pd.unique` and dispatch to `DataFrame/Series/Index` as necessary
- [ ] Add `return_inverse`-kwarg to all of them (and add to EA interface); under the hood by exposing the same kwarg to `duplicated` and `drop_duplicates` as well
- [ ] (something for later) solve #21720 (treatment of `np.nan/None` in `df.duplicated` inconsistent vs. Series behaviour)

Each point is essentially self-contained and independent of the others, but of course they make more sense together.


Provide feedback

Saved searches

Use saved searches to filter your results more quickly

API/ENH: overhaul/unify/improve .unique #22824

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

API/ENH: overhaul/unify/improve .unique #22824

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions