Description
in #39908 (comment) @jorisvandenbossche wrote
Bringing the "deferred" storage mode lookup for StringDtype
discussion (originally here: #39908 (comment)) in the main thread, and trying to recap.
Currently, doing pd.StringDtype()
(without specifying the storage), will already look up the option. In the default case, you get:
>>> pd.StringDtype().storage
'python'
which also means that pandas_dtype()
already "fully initializes" the string dtype:
>>> pd.api.types.pandas_dtype("string")
string[python]
As a consequence, doing astype("string")
will actually convert the values if your string dtype doesn't match the globab setting:
>>> s = pd.Series(['a', 'b'], dtype=pd.StringDtype(storage="pyarrow"))
>>> s.dtype
string[pyarrow]
>>> s.astype("string").dtype
string[python]
While I think it could make sense for .astype("string")
to mean: "ensure I have a string dtype", and thus don't convert to another storage backend if I already had a string dtype to start with.
We do something similar for CategoricalDtype ("category"
means a categorical dtype with no categories, but astype("category")
does not remove your categories, it preserves any existing categorical dtype as is).
We could still have the astype("string")
behave in the way I suggest by special casing this in the astype
implementations (as suggested in #39908 (comment))), but I think that's something we would ideally avoid (any astype
implementation accepting string dtype values as input would need to handle this case?)