Skip to content

String dtype: overview of breaking behaviour changes #59328

Open
@jorisvandenbossche

Description

@jorisvandenbossche

In context of the new default string dtype in 3.0 (#54792 / PDEP-14), currently enabled with pd.options.future.infer_string = True, there are a bunch of breaking changes that we will have to document.
In preparation of documenting, I want to use this issue to list all the behaviour changes that we are aware of (or run into) / potentially need to discuss if we actually want those changes.

First, there are a few obvious breaking changes that are also mentioned in the PDEP (and that are the main goals of the change):

  • Constructors and IO methods will now infer string data as a str dtype, instead of using object dtype.
  • Code checking for the dtype (e.g. ser.dtype == object) assuming object dtype, will break
  • The missing value sentinel is now always NaN, and for example no longer None (we still accept None as input, but it will be converted to NaN)

But additionally, there are some other less obvious changes or secondary consequences (or changes we already had a long time with the existing opt-in string dtype but will now be relevant for all).
Starting to list some of them here (and please add comments with other examples if you think of more).

astype(str) preserving missing values (no longer converting NaN to a string "nan")

This is a long standing "bug" (or at least generally agreed undesirable behaviour), as discussed in #25353.
Currently something like pd.Series(["foo", np.nan]).astype(str) would essentially convert every element to a string, including the missing values:

>>> ser = pd.Series(["foo", np.nan], dtype=object)
>>> ser
0    foo
1    NaN
dtype: object
>>> ser.astype(str)
0    foo
1    nan
dtype: object
>>> ser.astype(str).values
array(['foo', 'nan'], dtype=object)

Generally we expect missing values to propagate in astype(). And as a result of making str an alias for the new default string dtype (#59685), this will now follow a different code path and making use of the general StringDtype construction, which does preserve missing values;

>>> pd.options.future.infer_string = True
>>> ser = pd.Series(["foo", np.nan], dtype=object)
>>> ser.astype(str)
0    foo
1    NaN
dtype: str
>>> ser.astype(str).values
<StringArrayNumpySemantics>
['foo', nan]
Length: 2, dtype: str

Because

Mixed dtype operations

Any working code that previously relied on the object dtype allowing mixed types, where the initial data is now inferred as string dtype. Because the string dtype is now strict about only allowing strings, that means certain workflows will no longer work (unless users explicitly ensure to keep using object dtype).

For example, setitem with a non string:

>>> ser = pd.Series(["a", "b"], dtype="object")
>>> ser[0] = 1
>>> ser
0    1
1    b
dtype: object

>>> pd.options.future.infer_string = True
>>> ser = pd.Series(["a", "b"])
>>> ser[0] = 1
...
TypeError: Scalar must be NA or str

The same happens if you try to fill a column of strings and missing values with a non-string:

>>> pd.Series(["a", None]).fillna(0)
...
TypeError: Invalid value '0' for dtype string

Update: the above is kept working with upcasting to object dtype (see #60296)

Numeric aggregations

With object dtype strings, we do allow sum and prod in certain cases:

>>> pd.Series(["a", "b"], dtype="object").sum()
'ab'
>>> pd.Series(["a", "b"], dtype="string").sum()
...
TypeError: Cannot perform reduction 'sum' with string dtype

# prod only in case of 1 string (can be other missing values)
>>> pd.Series(["a"], dtype="object").prod()
'a'
>>> pd.Series(["a"], dtype="string").prod()
...
TypeError: Cannot perform reduction 'prod' with string dtype

Based on the discussion below, we decided to keep sum() working (#59853 is adding that functionality to string dtype), but prod() is fine to start raising.

Note: due to pyarrow implementation limitation, the sum is limited to 2GB result, see https://github.com/pandas-dev/pandas/pull/59853/files#r1794090618 (given this is about the size of a single Python string, that seems very unlikely to happen)

For any()/all() (which does work for object dtype, but didn't for the already existing StringDtype), we decided to keep this working for now for the new default string dtype, see #51939, #54591

Invalid unicode input

>>> pd.options.future.infer_string = False
>>> pd.Series(["\ud83d"])
0    \ud83d
dtype: object

>>> pd.options.future.infer_string = True
>>> pd.Series(["\ud83d"])
...
UnicodeEncodeError: 'utf-8' codec can't encode character '\ud83d' in position 0: surrogates not allowed

Users that want to keep the previous behaviour can explicitly specify dtype=object to keep working with object dtype.

Metadata

Metadata

Assignees

No one assigned

    Labels

    API DesignStringsString extension data type and string data

    Type

    No type

    Projects

    No projects

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions