String dtype: overview of breaking behaviour changes

In context of the new default string dtype in 3.0 (https://github.com/pandas-dev/pandas/issues/54792 / [PDEP-14](https://pandas.pydata.org/pdeps/0014-string-dtype.html)), currently enabled with `pd.options.future.infer_string = True`, there are a bunch of breaking changes that we will have to document. 
In preparation of documenting, I want to use this issue to list all the behaviour changes that we are aware of (or run into) / potentially need to discuss if we actually want those changes.

First, there are a few obvious breaking changes that are also mentioned in the PDEP (and that are the main _goals_ of the change):

- Constructors and IO methods will now infer string data as a `str` dtype, instead of using `object` dtype.
- Code checking for the dtype (e.g. `ser.dtype == object`) assuming object dtype, will break
- The missing value sentinel is now always `NaN`, and for example no longer `None` (we still accept None as input, but it will be converted to NaN)

But additionally, there are some other less obvious changes or secondary consequences (or changes we already had a long time with the existing opt-in `string` dtype but will now be relevant for all). 
Starting to list some of them here (and please add comments with other examples if you think of more).

#### `astype(str)` preserving missing values (no longer converting NaN to a string "nan")

This is a long standing "bug" (or at least generally agreed undesirable behaviour), as discussed in https://github.com/pandas-dev/pandas/issues/25353. 
Currently something like `pd.Series(["foo", np.nan]).astype(str)` would essentially convert every element to a string, including the missing values:

```python
>>> ser = pd.Series(["foo", np.nan], dtype=object)
>>> ser
0    foo
1    NaN
dtype: object
>>> ser.astype(str)
0    foo
1    nan
dtype: object
>>> ser.astype(str).values
array(['foo', 'nan'], dtype=object)
```

Generally we expect missing values to propagate in `astype()`. And as a result of making `str` an alias for the new default string dtype (https://github.com/pandas-dev/pandas/pull/59685), this will now follow a different code path and making use of the general StringDtype construction, which does preserve missing values;

```python
>>> pd.options.future.infer_string = True
>>> ser = pd.Series(["foo", np.nan], dtype=object)
>>> ser.astype(str)
0    foo
1    NaN
dtype: str
>>> ser.astype(str).values
<StringArrayNumpySemantics>
['foo', nan]
Length: 2, dtype: str
```

Because 

#### Mixed dtype operations

Any working code that previously relied on the object dtype allowing mixed types, where the initial data is now inferred as string dtype. Because the string dtype is now strict about only allowing strings, that means certain workflows will no longer work (unless users explicitly ensure to keep using object dtype).

For example, setitem with a non string:

```
>>> ser = pd.Series(["a", "b"], dtype="object")
>>> ser[0] = 1
>>> ser
0    1
1    b
dtype: object

>>> pd.options.future.infer_string = True
>>> ser = pd.Series(["a", "b"])
>>> ser[0] = 1
...
TypeError: Scalar must be NA or str
```

~The same happens if you try to fill a column of strings and missing values with a non-string:~

```python
>>> pd.Series(["a", None]).fillna(0)
...
TypeError: Invalid value '0' for dtype string
```

Update: the above is kept working with upcasting to object dtype (see https://github.com/pandas-dev/pandas/pull/60296)

#### Numeric aggregations

With object dtype strings, we do allow `sum` and `prod` in certain cases:

```python
>>> pd.Series(["a", "b"], dtype="object").sum()
'ab'
>>> pd.Series(["a", "b"], dtype="string").sum()
...
TypeError: Cannot perform reduction 'sum' with string dtype

# prod only in case of 1 string (can be other missing values)
>>> pd.Series(["a"], dtype="object").prod()
'a'
>>> pd.Series(["a"], dtype="string").prod()
...
TypeError: Cannot perform reduction 'prod' with string dtype
```

Based on the discussion below, we decided to keep `sum()` working (https://github.com/pandas-dev/pandas/pull/59853 is adding that functionality to string dtype), but `prod()` is fine to start raising.

Note: due to pyarrow implementation limitation, the sum is limited to 2GB result, see https://github.com/pandas-dev/pandas/pull/59853/files#r1794090618 (given this is about the size of a single Python string, that seems very unlikely to happen)

For `any()`/`all()` (which does work for object dtype, but didn't for the already existing StringDtype), we decided to keep this working for now for the new default string dtype, see https://github.com/pandas-dev/pandas/issues/51939, https://github.com/pandas-dev/pandas/pull/54591


#### Invalid unicode input

```python
>>> pd.options.future.infer_string = False
>>> pd.Series(["\ud83d"])
0    \ud83d
dtype: object

>>> pd.options.future.infer_string = True
>>> pd.Series(["\ud83d"])
...
UnicodeEncodeError: 'utf-8' codec can't encode character '\ud83d' in position 0: surrogates not allowed
```

Users that want to keep the previous behaviour can explicitly specify `dtype=object` to keep working with object dtype.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

String dtype: overview of breaking behaviour changes #59328

`astype(str)` preserving missing values (no longer converting NaN to a string "nan")

Mixed dtype operations

Numeric aggregations

Invalid unicode input

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

String dtype: overview of breaking behaviour changes #59328

Description

astype(str) preserving missing values (no longer converting NaN to a string "nan")

Mixed dtype operations

Numeric aggregations

Invalid unicode input

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions

`astype(str)` preserving missing values (no longer converting NaN to a string "nan")