Skip to content

API: Categorical.unique() should not drop unused categories #21648

Closed
@topper-123

Description

@topper-123

Currently, Categorical.unique and CategoricalIndex.unique drop unused categories:

>>> categories = ['very good', 'good', 'neutral', 'bad', 'very bad']
>>> cat = pd.Categorical(['good','good', 'bad', 'bad'], categories=categories, ordered=True)
>>> cat
[good, good, bad, bad]
Categories (5, object): [very good < good < neutral < bad < very bad]
>>> cat.unique()
[good, bad]
Categories (2, object): [good < bad]  # unused categories dropped

So, .unique() both uniquefies and drops unused categories (does two things in one operation)

Often, even if you want to uniquefy values, you still want to control whether to drop unused categories or not. So Categorical/CategoricalIndex.unique should IMO keep all categories, and categories should be dropped in a seperate action. So, this would be a better API:

>>> cat.unique()
[good, bad]
Categories (5, object): [very good < good < neutral < bad < very bad]    # unused not dropped

If you want to drop unused categories, you should do it explicitly like so: cat.unique().remove_unused_categories().

The proposed API is also faster, as dropping unused categories requires recoding the categories/codes, which is potentially expensive.

Metadata

Metadata

Assignees

No one assigned

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions