Skip to content

Change default string storage from "python" to "pyarrow" (if installed) for for NA-variant of StringDtype #60287

Open
@jorisvandenbossche

Description

@jorisvandenbossche

Historically, the default value for the string storage (globally configurable through pd.options.mode.string_storage) of StringDtype was "python", and users needed to explicitly ask for "pyarrow". For example:

>>> ser = pd.Series(["a", "b"], dtype="string")
>>>  ser.dtype
string[python]

and this is still the behaviour on main.

For the new NaN-variant of StringDtype, however, we implemented the default string storage option "auto" meaning "use pyarrow if installed, otherwise use python". So on a system with pyarrow installed:

>>> pd.options.future.infer_string = True
>>> ser = pd.Series(["a", "b"], dtype="str")
>>> ser.dtype.storage
'pyarrow'

Essentially we interpret the default string_storage option setting of "auto" differently for the NaN vs NA variant of the string dtype, which you can see in the code here:

if storage is None:
if na_value is not libmissing.NA:
storage = get_option("mode.string_storage")
if storage == "auto":
if HAS_PYARROW:
storage = "pyarrow"
else:
storage = "python"
else:
storage = get_option("mode.string_storage")
if storage == "auto":
storage = "python"


Proposal: I think it makes sense to also switch to "pyarrow" as the default string storage (if installed) for the nullable StringDtype. This is somewhat a breaking change (although mostly for the dtype object itself, because behaviour-wise for string operations, there should be hardly any difference between both backends), so I would keep this for 3.0 and properly document it in the whatsnew notes.

Metadata

Metadata

Assignees

No one assigned

    Labels

    API DesignNA - MaskedArraysRelated to pd.NA and nullable extension arraysStringsString extension data type and string data

    Type

    No type

    Projects

    No projects

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions