Description
Historically, the default value for the string storage (globally configurable through pd.options.mode.string_storage
) of StringDtype
was "python"
, and users needed to explicitly ask for "pyarrow"
. For example:
>>> ser = pd.Series(["a", "b"], dtype="string")
>>> ser.dtype
string[python]
and this is still the behaviour on main
.
For the new NaN-variant of StringDtype
, however, we implemented the default string storage option "auto"
meaning "use pyarrow if installed, otherwise use python". So on a system with pyarrow installed:
>>> pd.options.future.infer_string = True
>>> ser = pd.Series(["a", "b"], dtype="str")
>>> ser.dtype.storage
'pyarrow'
Essentially we interpret the default string_storage
option setting of "auto"
differently for the NaN vs NA variant of the string dtype, which you can see in the code here:
pandas/pandas/core/arrays/string_.py
Lines 152 to 163 in 5f23ace
Proposal: I think it makes sense to also switch to "pyarrow" as the default string storage (if installed) for the nullable StringDtype. This is somewhat a breaking change (although mostly for the dtype object itself, because behaviour-wise for string operations, there should be hardly any difference between both backends), so I would keep this for 3.0 and properly document it in the whatsnew notes.