Description
xref #8640 #12699 #13361 #13410
There's been discussion of a few overlapping uses of Categorical
:
- as 'true' categorical data with a known set of values
- as 'lazy' categorical data which adds new categories as needed
- as an interned string data type, with no particular categorical interpretation
Option 1 is currently well-supported. Option 2 can be achieved explicitly by union_categorical
, see #13361 #13410, but will not happen automatically e.g. when setting values or concatenating, see #12699. Option 3 is similar to 2, and has been discussed as a new String
type, see #8640, but may have different semantics to 2.
While I completely agree that option 1 should be the default, I'd like to see more support for option 2 if possible; there are cases where I really do want to work with categorical data, but I don't yet know what the categories are, e.g. when concatenating a table together from several different source files.
One option would be to mimic Matlab's 'protected' flag for categorical data. By default, a Categorical
would be created protected, and throw errors when performing actions which implicitly change the category set. However, the user could choose to declare a Categorical
as unprotected, in which case these operations would be performed as intended.
While this behaviour could be achieved with the proposed String
type, it's unclear whether that type would share the Categorical
API, or be efficiently convertible to Categorical
. Having the behaviour as part of Categorical
would allow the user to build up a Categorical
iteratively, from multiple sources, and then quickly mark it protected once the category set is known.