Skip to content

API/ENH: unprotected Categorical #13506

Closed
@adbull

Description

@adbull

xref #8640 #12699 #13361 #13410

There's been discussion of a few overlapping uses of Categorical:

  1. as 'true' categorical data with a known set of values
  2. as 'lazy' categorical data which adds new categories as needed
  3. as an interned string data type, with no particular categorical interpretation

Option 1 is currently well-supported. Option 2 can be achieved explicitly by union_categorical, see #13361 #13410, but will not happen automatically e.g. when setting values or concatenating, see #12699. Option 3 is similar to 2, and has been discussed as a new String type, see #8640, but may have different semantics to 2.

While I completely agree that option 1 should be the default, I'd like to see more support for option 2 if possible; there are cases where I really do want to work with categorical data, but I don't yet know what the categories are, e.g. when concatenating a table together from several different source files.

One option would be to mimic Matlab's 'protected' flag for categorical data. By default, a Categorical would be created protected, and throw errors when performing actions which implicitly change the category set. However, the user could choose to declare a Categorical as unprotected, in which case these operations would be performed as intended.

While this behaviour could be achieved with the proposed String type, it's unclear whether that type would share the Categorical API, or be efficiently convertible to Categorical. Having the behaviour as part of Categorical would allow the user to build up a Categorical iteratively, from multiple sources, and then quickly mark it protected once the category set is known.

Metadata

Metadata

Assignees

No one assigned

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions