Skip to content

API: Expand read_csv dtype for categoricals #14503

Open
@TomAugspurger

Description

@TomAugspurger

In #13406 Chris added support for read_csv(..., dtype={'col': 'category'}) (thanks!). This issue is for expanding that syntax to allow a more complete specification of the resulting categorical.

# Your code here
df = pd.read_csv(path, dtype={'col': pd.Categorical(['a', 'b', 'c'], ordered=True})
df = pd.read_csv(path, dtype={'col': ['a', 'b', 'c']})  # shorthand, but unordered only
# we would still accept `dtype={'col': 'category'}` as well, to infer categories

Implementation-wise, I think we can keep all the parsing logic as is, and simply loop over dtype and call set_categories (and maybe as_ordered) on all the categoricals just before returning to the user.

This would help a bit in dask, where their category type inference can fail if the first partition doesn't contain all the categories (see dask/dask#1705). This is why it'd be preferable to do it as an option to read_csv, rather than putting in on the user to followup with a set_categories.

Metadata

Metadata

Assignees

No one assigned

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions