Skip to content

API: string dtype propagation of NaNs in predicate methods (eg .str.startswith) #54805

Closed
@jorisvandenbossche

Description

@jorisvandenbossche

From #54533 (comment) (and relevant for new String dtype #54792)

Currently, when having a string column with missing values, and calling one of the string methods that return a boolean series (such as .str.startswith(..)), the NaN or None values are preserved, and the result is an object-dtype series containing a mix of True/False and NaN/None. This is true for the current default object dtype with strings, but also for the specific StringDtype:

>>> pd.Series(["a", "b", None], dtype="object").str.startswith("a")
0     True
1    False
2     None
dtype: object

>>> pd.Series(["a", "b", None], dtype="string[pyarrow_numpy]").str.startswith("a")
0     True
1    False
2      NaN
dtype: object

This behaviour is also present when using the "nullable" version of StringDtype (with string_storage of "python" or "pyarrow") or using the ArrowDtype("string"):

>>> pd.Series(["a", "b", None], dtype="string[python]").str.startswith("a")
0     True
1    False
2     <NA>
dtype: boolean

Here, this makes sense and doesn't pose any usability problems, because the resulting boolean dtype is also nullable.

But in the first two examples, where the resulting boolean dtype would be the numpy bool dtype, we fall back to object-dtype when missing values are present.
And this gives some usability issues for the result. For example, you can't use that result for boolean indexing:

>>> ser = pd.Series(["a", "b", None], dtype="string[pyarrow_numpy]")
>>> ser[ser.str.startswith("a")]
...
ValueError: Cannot mask with non-boolean array containing NA / NaN values

I know this has been long standing behaviour for the object dtype way of using strings (and I can't remember getting too many complaints about this?). But when making the change for 3.0 with the new default string dtype, I think we do have a chance to make this easier to work with, and ensure those methods always return a bool dtype (by using False instead of propagating NaN).
On the other hand, for the nullable versions of the string dtype, we probably want to keep the propagating behaviour, and so that would introduce a new inconsistency between the different string storage types (but, this is also an inconsistency that already exists for other cases, such as comparison operators like == propagating NA vs giving False).

Metadata

Metadata

Assignees

No one assigned

    Labels

    API DesignStringsString extension data type and string data

    Type

    No type

    Projects

    No projects

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions