Skip to content

TRACKER: new default String dtype (pyarrow-backed, numpy NaN semantics) #54792

Open
@jorisvandenbossche

Description

@jorisvandenbossche

Overview of work for the future string dtype (PDEP-14).

Main implementation:

Testing related:

Open design questions / behaviour changes to implement:

Known bugs that need to be fixed:

Documentation:


[original issue body]

With PDEP-10 (#52711, https://pandas.pydata.org/pdeps/0010-required-pyarrow-dependency.html), we decided to start using pyarrow for the default string data type in pandas 3.0.

For pandas 2.1, an option was added to already enable this future default data type, and then the various ways to construct a DataFrame (type inference in the constructor, IO methods) will use the new string dtype as default:

>>> pd.options.future.infer_string = True

>>> pd.Series(["a", "b", None])
0      a
1      b
2    NaN
dtype: string

>>> pd.Series(["a", "b", None]).dtype
string[pyarrow_numpy]

This is documented at https://pandas.pydata.org/docs/dev/whatsnew/v2.1.0.html#whatsnew-210-enhancements-infer-strings

One aspect that was discussed after the PDEP (mostly at the sprint, I think; creating this issue for a better public record of it), is that for a data type that would become the default in pandas 3.0 (which for the rest still uses all numpy dtypes with numpy NaN missing value semantics), should probably also still use the same default semantics and result in numpy data types when doing operations on the string column that result in a boolean or numeric data type (eg .str.startswith(..), .str.len(..), .str.count(), etc, or comparison operators like ==).
(this way, a user only gets an ArrowDtype column when explicitly asking for those, and not by default through using a the default string dtype)

To achieve this, @phofl has done several PRs to refactor the current pyarrow-based string dtypes, to add another variant which uses StringDtype(storage="pyarrow_numpy") instead of ArrowDtype("string"). From the updated whatsnew: "This is a new string dtype implementation that follows NumPy semantics in comparison operations and will return np.nan as the missing value indicator". Main PR:

plus some follow-ups (#54720, #54585, #54591).

cc @pandas-dev/pandas-core

Metadata

Metadata

Assignees

No one assigned

    Labels

    API DesignArrowpyarrow functionalityStringsString extension data type and string data

    Type

    No type

    Projects

    No projects

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions