Description
Categorical dtypes
xref gh-26 for some discussion on categorical dtypes.
What it looks like in different libraries
Pandas
The dtype is called category
there. See pandas.Categorical docs:
>>> df = pd.DataFrame({"A": [1, 2, 5, 1]})
>>> df["B"] = df["A"].astype("category")
>>> df.dtypes
A int64
B category
dtype: object
>>> col = df['B']
>>> col.dtype
CategoricalDtype(categories=[1, 2, 5], ordered=False)
>>> col.values.ordered
False
>>> col.values.codes
array([0, 1, 2, 0], dtype=int8)
>>> col.values.categories
Int64Index([1, 2, 5], dtype='int64')
>>> col.values.categories.values
array([1, 2, 5])
Apache Arrow
The dtype is called _"dictionary-encoded" in Arrow - so a dataframe with a categorical dtype is called a "dictionary-encoded array" there.
See https://arrow.apache.org/docs/format/CDataInterface.html#structure-definitions for details.
A practical example (from @kkraus14 in gh-38), for a categorical column of
['gold', 'bronze', 'silver', null, 'bronze', 'silver', 'gold']
with categories of
['gold' < 'silver' < 'bronze']
:
categorical column: {
mask_buffer: [119], # 01110111 in binary
data_buffer: [0, 2, 1, 127, 2, 1, 0], # the 127 value in here is undefined since it's null
children: [
string column: {
mask_buffer: None,
offsets_buffer: [0, 4, 10, 16],
data_buffer: [103, 111, 108, 100, 115, 105, 108, 118, 101, 114, 98, 114, 111, 110, 122, 101]
}
]
}
struct ArrowSchema {
// Array type description
const char* format;
const char* name;
const char* metadata;
int64_t flags;
int64_t n_children;
struct ArrowSchema** children;
struct ArrowSchema* dictionary; // the categories
...
};
struct ArrowArray {
// Array data description
int64_t length;
int64_t null_count;
int64_t offset;
int64_t n_buffers;
int64_t n_children;
const void** buffers;
struct ArrowArray** children;
struct ArrowArray* dictionary;
...
};
Also see https://arrow.apache.org/docs/python/data.html#dictionary-arrays for what PyArrow does - it matches the current exchange protocol more closely than the Arrow C Data Interface. E.g., it uses an actual Python dictionary for the mapping of values to categories.
Vaex
EDIT: Vaex's API was done pre Arrow integration, and will change to match Arrow in the future.
>>> import vaex
... >>> df = vaex.from_arrays(year=[2012, 2015, 2019], weekday=[0, 4, 6])
... >>> df = df.categorize('year', min_value=2020, max_value=2019)
... >>> df = df.categorize('weekday', labels=['Mon', 'Tue', 'Wed', 'Thu', 'Fr
... i', 'Sat', 'Sun'])
>>>
>>> df.dtypes
year int64
weekday int64
dtype: object
>>> df.is_category('year')
True
>>> df.is_category('weekday')
True
>>> df._categories
{'year': {'labels': [], 'N': 0, 'min_value': 2020}, 'weekday': {'labels': ['Mon', 'Tue', 'Wed', 'Thu', 'Fri', 'Sat', 'Sun'], 'N': 7, 'min_value': 0}}
Other libraries
- Modin follows Pandas
- Dask follows Pandas
- Koalas does not support categorical dtypes at all
Exchange protocol
This is the current form in gh-38 for the Pandas implementation of the exchange protocol:
>>> col = df.__dataframe__().get_column_by_name('B')
>>> col
<__main__._PandasColumn object at 0x7f0202973211>
>>> col.dtype # kind, bitwidth, format-string, endianness
(23, 64, '|O08', '=')
>>> col.describe_categorical # is_ordered, is_dictionary, mapping
(False, True, {0: 1, 1: 2, 2: 5})
>>> col.describe_null # kind (2 = sentinel value), value
(2, -1)
Changes needed & discussion points
What we already determined needs changing:
- Add
get_children()
method, and store themapping
that is now inColumn.describe_categorical
in a child column instead. Note that child columns are also needed for variable-length strings.
To discuss:
- If
dtype
is the logical dtype for the column, where to store how to interpret the actual data buffer? Right now this is done not in a static attribute but by returning the dtype along with the buffer when accessing it:
def get_data_buffer(self) -> Tuple[_PandasBuffer, _Dtype]:
"""
Return the buffer containing the data.
"""
_k = _DtypeKind
if self.dtype[0] in (_k.INT, _k.UINT, _k.FLOAT, _k.BOOL):
buffer = _PandasBuffer(self._col.to_numpy())
dtype = self.dtype
elif self.dtype[0] == _k.CATEGORICAL:
codes = self._col.values.codes
buffer = _PandasBuffer(codes)
dtype = self._dtype_from_pandasdtype(codes.dtype)
else:
raise NotImplementedError(f"Data type {self._col.dtype} not handled yet")
return buffer, dtype
-
What goes in the data buffer on the column? The category-encoded data makes sense, because the buffer needs to be the same size as the column (number of elements), otherwise it would be inconsistent with other dtypes.
- What happens when the data is strings?