TRACKER: new default String dtype (pyarrow-backed, numpy NaN semantics)

Overview of work for the future string dtype ([PDEP-14](https://pandas.pydata.org/pdeps/0014-string-dtype.html)).

Main implementation:
- [x] Implement the object-dtype based fallback: 
  - [x] https://github.com/pandas-dev/pandas/pull/58451
- [x] Rename the storage options (from the PDEP: `storage="pyarrow_numpy"` to `storage="pyarrow", na_value=np.nan`)
  - [x] https://github.com/pandas-dev/pandas/pull/59330 
  - [x] Update the tests to stop using "pyarrow_numpy" -> https://github.com/pandas-dev/pandas/pull/59758
  - [x] Deprecate the "pyarrow_numpy" storage option -> https://github.com/pandas-dev/pandas/pull/60152
  - [x] https://github.com/pandas-dev/pandas/pull/59376
- [x] Change the string alias to `"str"` for the NaN-variant of the dtype
  - [x] https://github.com/pandas-dev/pandas/pull/59388 
- [ ] Ensure `dtype=str`/ `astype(str)` works as an alias when the future mode is enabled (and any other alias which will currently convert the input to strings, like "U"?)
- [x] https://github.com/pandas-dev/pandas/pull/59610

Testing related:
- [x] Set up test build with `future.infer_string` enabled: 
  - [x] https://github.com/pandas-dev/pandas/pull/58459
  - [x] https://github.com/pandas-dev/pandas/pull/59329 and https://github.com/pandas-dev/pandas/pull/59352
  - [x] https://github.com/pandas-dev/pandas/pull/59345
  - [x] https://github.com/pandas-dev/pandas/pull/59368
  - [x] https://github.com/pandas-dev/pandas/pull/59375
  - [x] Add an equivalent build with infer_string but without pyarrow installed: https://github.com/pandas-dev/pandas/pull/59437
- [ ] Get all tests passing with `future.infer_string` (tackle all `xfails` / `TODO(infer_string)` tests)
  - [x] https://github.com/pandas-dev/pandas/pull/59323
  - [x] https://github.com/pandas-dev/pandas/pull/59430
  - [x] https://github.com/pandas-dev/pandas/pull/59433
  - [ ] Fix tests:
    - [x] IO - SQL
    - [ ] IO - pytables
    - [x] IO - parser
    - [x] IO - parquet/feather/orc: https://github.com/pandas-dev/pandas/pull/59478
    - [ ] ...

Open design questions / behaviour changes to implement:

* [x] https://github.com/pandas-dev/pandas/issues/54805
* [x] https://github.com/pandas-dev/pandas/issues/54807
* [x] https://github.com/pandas-dev/pandas/issues/58581

Known bugs that need to be fixed:

- [x] https://github.com/pandas-dev/pandas/issues/55834
- [x] https://github.com/pandas-dev/pandas/issues/59879
- [ ] https://github.com/pandas-dev/pandas/issues/55833
- [ ] https://github.com/pandas-dev/pandas/issues/54798
- [ ] https://github.com/pandas-dev/pandas/issues/54190
- [ ] https://github.com/pandas-dev/pandas/issues/60234
- [ ] https://github.com/pandas-dev/pandas/issues/60228
- [x] https://github.com/pandas-dev/pandas/issues/60229


Documentation:
- [ ] List and document all breaking / behaviour changes in upgrade guide
  - [ ] https://github.com/pandas-dev/pandas/issues/59328

---

*[original issue body]*

With PDEP-10 (https://github.com/pandas-dev/pandas/pull/52711/, https://pandas.pydata.org/pdeps/0010-required-pyarrow-dependency.html), we decided to start using pyarrow for the default string data type in pandas 3.0.

For pandas 2.1, an option was added to already enable this future default data type, and then the various ways to construct a DataFrame (type inference in the constructor, IO methods) will use the new string dtype as default:

```python
>>> pd.options.future.infer_string = True

>>> pd.Series(["a", "b", None])
0      a
1      b
2    NaN
dtype: string

>>> pd.Series(["a", "b", None]).dtype
string[pyarrow_numpy]
```

This is documented at https://pandas.pydata.org/docs/dev/whatsnew/v2.1.0.html#whatsnew-210-enhancements-infer-strings

One aspect that was discussed after the PDEP (mostly at the sprint, I think; creating this issue for a better public record of it), is that for a data type that would become the default in pandas 3.0 (which for the rest still uses all numpy dtypes with numpy NaN missing value semantics), should probably also still use the same default semantics and result in numpy data types when doing operations on the string column that result in a boolean or numeric data type (eg `.str.startswith(..)`, `.str.len(..)`, `.str.count()`, etc, or comparison operators like `==`). 
(this way, a user only gets an ArrowDtype column when explicitly asking for those, and not by default through using a the default string dtype)

To achieve this, @phofl has done several PRs to refactor the current pyarrow-based string dtypes, to add another variant which uses `StringDtype(storage="pyarrow_numpy")` instead of `ArrowDtype("string")`. From the updated whatsnew: *"This is a new string dtype implementation that follows NumPy semantics in comparison operations and will return ``np.nan`` as the missing value indicator"*. Main PR:

* https://github.com/pandas-dev/pandas/pull/54533

plus some follow-ups (https://github.com/pandas-dev/pandas/pull/54720, https://github.com/pandas-dev/pandas/pull/54585, https://github.com/pandas-dev/pandas/pull/54591).

cc @pandas-dev/pandas-core 


Provide feedback

Saved searches

Use saved searches to filter your results more quickly

TRACKER: new default String dtype (pyarrow-backed, numpy NaN semantics) #54792

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

TRACKER: new default String dtype (pyarrow-backed, numpy NaN semantics) #54792

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions