Description
In context of the new default string dtype in 3.0 (#54792 / PDEP-14), currently enabled with pd.options.future.infer_string = True
, there are a bunch of breaking changes that we will have to document.
In preparation of documenting, I want to use this issue to list all the behaviour changes that we are aware of (or run into) / potentially need to discuss if we actually want those changes.
First, there are a few obvious breaking changes that are also mentioned in the PDEP (and that are the main goals of the change):
- Constructors and IO methods will now infer string data as a
str
dtype, instead of usingobject
dtype. - Code checking for the dtype (e.g.
ser.dtype == object
) assuming object dtype, will break - The missing value sentinel is now always
NaN
, and for example no longerNone
(we still accept None as input, but it will be converted to NaN)
But additionally, there are some other less obvious changes or secondary consequences (or changes we already had a long time with the existing opt-in string
dtype but will now be relevant for all).
Starting to list some of them here (and please add comments with other examples if you think of more).
astype(str)
preserving missing values (no longer converting NaN to a string "nan")
This is a long standing "bug" (or at least generally agreed undesirable behaviour), as discussed in #25353.
Currently something like pd.Series(["foo", np.nan]).astype(str)
would essentially convert every element to a string, including the missing values:
>>> ser = pd.Series(["foo", np.nan], dtype=object)
>>> ser
0 foo
1 NaN
dtype: object
>>> ser.astype(str)
0 foo
1 nan
dtype: object
>>> ser.astype(str).values
array(['foo', 'nan'], dtype=object)
Generally we expect missing values to propagate in astype()
. And as a result of making str
an alias for the new default string dtype (#59685), this will now follow a different code path and making use of the general StringDtype construction, which does preserve missing values;
>>> pd.options.future.infer_string = True
>>> ser = pd.Series(["foo", np.nan], dtype=object)
>>> ser.astype(str)
0 foo
1 NaN
dtype: str
>>> ser.astype(str).values
<StringArrayNumpySemantics>
['foo', nan]
Length: 2, dtype: str
Because
Mixed dtype operations
Any working code that previously relied on the object dtype allowing mixed types, where the initial data is now inferred as string dtype. Because the string dtype is now strict about only allowing strings, that means certain workflows will no longer work (unless users explicitly ensure to keep using object dtype).
For example, setitem with a non string:
>>> ser = pd.Series(["a", "b"], dtype="object")
>>> ser[0] = 1
>>> ser
0 1
1 b
dtype: object
>>> pd.options.future.infer_string = True
>>> ser = pd.Series(["a", "b"])
>>> ser[0] = 1
...
TypeError: Scalar must be NA or str
The same happens if you try to fill a column of strings and missing values with a non-string:
>>> pd.Series(["a", None]).fillna(0)
...
TypeError: Invalid value '0' for dtype string
Update: the above is kept working with upcasting to object dtype (see #60296)
Numeric aggregations
With object dtype strings, we do allow sum
and prod
in certain cases:
>>> pd.Series(["a", "b"], dtype="object").sum()
'ab'
>>> pd.Series(["a", "b"], dtype="string").sum()
...
TypeError: Cannot perform reduction 'sum' with string dtype
# prod only in case of 1 string (can be other missing values)
>>> pd.Series(["a"], dtype="object").prod()
'a'
>>> pd.Series(["a"], dtype="string").prod()
...
TypeError: Cannot perform reduction 'prod' with string dtype
Based on the discussion below, we decided to keep sum()
working (#59853 is adding that functionality to string dtype), but prod()
is fine to start raising.
Note: due to pyarrow implementation limitation, the sum is limited to 2GB result, see https://github.com/pandas-dev/pandas/pull/59853/files#r1794090618 (given this is about the size of a single Python string, that seems very unlikely to happen)
For any()
/all()
(which does work for object dtype, but didn't for the already existing StringDtype), we decided to keep this working for now for the new default string dtype, see #51939, #54591
Invalid unicode input
>>> pd.options.future.infer_string = False
>>> pd.Series(["\ud83d"])
0 \ud83d
dtype: object
>>> pd.options.future.infer_string = True
>>> pd.Series(["\ud83d"])
...
UnicodeEncodeError: 'utf-8' codec can't encode character '\ud83d' in position 0: surrogates not allowed
Users that want to keep the previous behaviour can explicitly specify dtype=object
to keep working with object dtype.