API: string dtype propagation of NaNs in predicate methods (eg .str.startswith)

From https://github.com/pandas-dev/pandas/pull/54533#discussion_r1299741580 (and relevant for new String dtype https://github.com/pandas-dev/pandas/issues/54792)

Currently, when having a string column with missing values, and calling one of the string methods that return a boolean series (such as `.str.startswith(..)`), the NaN or None values are preserved, and the result is an object-dtype series containing a mix of True/False and NaN/None. This is true for the current default object dtype with strings, but also for the specific StringDtype:

```
>>> pd.Series(["a", "b", None], dtype="object").str.startswith("a")
0     True
1    False
2     None
dtype: object

>>> pd.Series(["a", "b", None], dtype="string[pyarrow_numpy]").str.startswith("a")
0     True
1    False
2      NaN
dtype: object
```

This behaviour is also present when using the "nullable" version of StringDtype (with string_storage of "python" or "pyarrow") or using the ArrowDtype("string"):

```
>>> pd.Series(["a", "b", None], dtype="string[python]").str.startswith("a")
0     True
1    False
2     <NA>
dtype: boolean
```

Here, this makes sense and doesn't pose any usability problems, because the resulting boolean dtype is also nullable. 

But in the first two examples, where the resulting boolean dtype would be the numpy bool dtype, we fall back to object-dtype when missing values are present. 
And this gives some usability issues for the result. For example, you can't use that result for boolean indexing:

```
>>> ser = pd.Series(["a", "b", None], dtype="string[pyarrow_numpy]")
>>> ser[ser.str.startswith("a")]
...
ValueError: Cannot mask with non-boolean array containing NA / NaN values
```

I know this has been long standing behaviour for the object dtype way of using strings (and I can't remember getting too many complaints about this?). But when making the change for 3.0 with the new default string dtype, I think we _do_ have a chance to make this easier to work with, and ensure those methods always return a bool dtype (by using `False` instead of propagating `NaN`).  
On the other hand, for the nullable versions of the string dtype, we probably want to keep the propagating behaviour, and so that would introduce a new inconsistency between the different string storage types (but, this is also an inconsistency that already exists for other cases, such as comparison operators like `==` propagating NA vs giving False).



Provide feedback

Saved searches

Use saved searches to filter your results more quickly

API: string dtype propagation of NaNs in predicate methods (eg .str.startswith) #54805

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

API: string dtype propagation of NaNs in predicate methods (eg .str.startswith) #54805

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions