Open
Description
In #13406 Chris added support for read_csv(..., dtype={'col': 'category'})
(thanks!). This issue is for expanding that syntax to allow a more complete specification of the resulting categorical.
# Your code here
df = pd.read_csv(path, dtype={'col': pd.Categorical(['a', 'b', 'c'], ordered=True})
df = pd.read_csv(path, dtype={'col': ['a', 'b', 'c']}) # shorthand, but unordered only
# we would still accept `dtype={'col': 'category'}` as well, to infer categories
Implementation-wise, I think we can keep all the parsing logic as is, and simply loop over dtype
and call set_categories
(and maybe as_ordered
) on all the categoricals just before returning to the user.
This would help a bit in dask, where their category type inference can fail if the first partition doesn't contain all the categories (see dask/dask#1705). This is why it'd be preferable to do it as an option to read_csv
, rather than putting in on the user to followup with a set_categories
.