Closed
Description
Currently, Categorical.unique
and CategoricalIndex.unique
drop unused categories:
>>> categories = ['very good', 'good', 'neutral', 'bad', 'very bad']
>>> cat = pd.Categorical(['good','good', 'bad', 'bad'], categories=categories, ordered=True)
>>> cat
[good, good, bad, bad]
Categories (5, object): [very good < good < neutral < bad < very bad]
>>> cat.unique()
[good, bad]
Categories (2, object): [good < bad] # unused categories dropped
So, .unique()
both uniquefies and drops unused categories (does two things in one operation)
Often, even if you want to uniquefy values, you still want to control whether to drop unused categories or not. So Categorical/CategoricalIndex.unique
should IMO keep all categories, and categories should be dropped in a seperate action. So, this would be a better API:
>>> cat.unique()
[good, bad]
Categories (5, object): [very good < good < neutral < bad < very bad] # unused not dropped
If you want to drop unused categories, you should do it explicitly like so: cat.unique().remove_unused_categories()
.
The proposed API is also faster, as dropping unused categories requires recoding the categories/codes, which is potentially expensive.