API: Expand read_csv dtype for categoricals

In https://github.com/pandas-dev/pandas/pull/13406 Chris added support for `read_csv(..., dtype={'col': 'category'})` (thanks!). This issue is for expanding that syntax to allow a more complete specification of the resulting categorical.

``` python
# Your code here
df = pd.read_csv(path, dtype={'col': pd.Categorical(['a', 'b', 'c'], ordered=True})
df = pd.read_csv(path, dtype={'col': ['a', 'b', 'c']})  # shorthand, but unordered only
# we would still accept `dtype={'col': 'category'}` as well, to infer categories
```

Implementation-wise, I think we can keep all the parsing logic as is, and simply loop over `dtype` and call `set_categories` (and maybe `as_ordered`) on all the categoricals just before returning to the user.

This would help a bit in dask, where their category type inference can fail if the first partition doesn't contain all the categories (see https://github.com/dask/dask/issues/1705). This is why it'd be preferable to do it as an option to `read_csv`, rather than putting in on the user to followup with a `set_categories`.


Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Uh oh!

API: Expand read_csv dtype for categoricals #14503

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Uh oh!

API: Expand read_csv dtype for categoricals #14503

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions