Description
The state of the various flavours of .unique
as of v0.23
:
[pd/Series/Index].unique
does not havekeep
-kwargSeries.unique
returns array,Series.drop_duplicates
returnsSeries
. Returning a plainnp.ndarray
is quite unusual for aSeries
method, and furthermore the differences between these closely-related methods are confusing from a user perspective, IMO- same point for
Index
DataFrame.unique
does not exist, but is a much more natural candidate (from the behaviour of numpy, resp.Series/Index
) than.drop_duplicates
pd.unique
chokes on 2-dimensional data- no
return_inverse
-kwarg for any of the.unique
variants; see API: provide a better way of doing np.unique(return_inverses=True) #4087 (milestoned since 0.14), ENH: adding .unique() to DF (or return_inverse for duplicated) #21357
I originally wanted to add df.unique(..., return_inverse=True|False)
for #21357, but got directed to add it to duplicated
instead. After slow progress over 3 months in #21645 (PR essentially finished since 2), @jorisvandenbossche brought up the - justified (IMO) - feedback that:
I think my main worry is that we are adding a
return_inverse
keyword which actually does not return the inverse for that function (it does return the inverse for another function), and that it is in name similar to numpy's keyword, but in usage also different.
and
[...] it might make sense to add this to
pd.unique
/Series.unique
as well? (not necessarily at the same time; or might actually be an easier starter)
This prompted me to have another look at the situation with .unique
, and I found the list of the above inconsistencies. To resolve them, I suggest to:
- Change return type for
[Series/Index].unique
to be same as caller (deprecation cycle by introducingraw=None
which at first defaults to True?) - Add
keep
-kwarg to[Series/Index].unique
(make.unique
a wrapper around.drop_duplicates
?) - Add
df.unique
(as thin wrapper around.drop_duplicates
?) - Add
keep
-kwarg topd.unique
and dispatch toDataFrame/Series/Index
as necessary - Add
return_inverse
-kwarg to all of them (and add to EA interface); under the hood by exposing the same kwarg toduplicated
anddrop_duplicates
as well - (something for later) solve BUG: df.duplicated treats None as np.nan in object columns #21720 (treatment of
np.nan/None
indf.duplicated
inconsistent vs. Series behaviour)
Each point is essentially self-contained and independent of the others, but of course they make more sense together.