Description
Overview of work for the future string dtype (PDEP-14).
Main implementation:
- Implement the object-dtype based fallback:
- Rename the storage options (from the PDEP:
storage="pyarrow_numpy"
tostorage="pyarrow", na_value=np.nan
)- String dtype: rename the storage options and add
na_value
keyword inStringDtype()
#59330 - Update the tests to stop using "pyarrow_numpy" -> TST (string dtype): remove usage of 'string[pyarrow_numpy]' alias #59758
- Deprecate the "pyarrow_numpy" storage option -> String dtype: deprecate the pyarrow_numpy storage option #60152
- String dtype: restrict options.mode.string_storage to python|pyarrow (remove pyarrow_numpy) #59376
- String dtype: rename the storage options and add
- Change the string alias to
"str"
for the NaN-variant of the dtype - Ensure
dtype=str
/astype(str)
works as an alias when the future mode is enabled (and any other alias which will currently convert the input to strings, like "U"?) - String dtype: avoid surfacing pyarrow exception in binary operations #59610
Testing related:
- Set up test build with
future.infer_string
enabled:- TST / string dtype: add env variable to enable future_string and add test build #58459
- TST (string dtype): xfail all currently failing tests with future.infer_string #59329 and TST (string dtype): follow-up on GH-59329 fixing new xfails #59352
- TST (string dtype): change any_string_dtype fixture to use actual dtype instances #59345
- TST (string dtype): remove usage of arrow_string_storage fixture #59368
- TST (string dtype): replace string_storage fixture with explicit storage/na_value keyword arguments for dtype creation #59375
- Add an equivalent build with infer_string but without pyarrow installed: TST (string dtype): add test build with future strings enabled without pyarrow #59437
- Get all tests passing with
future.infer_string
(tackle allxfails
/TODO(infer_string)
tests)- TST (string dtype): clean-up xpasssing tests with future string dtype #59323
- TST (string dtype): fix groupby xfails with using_infer_string + update error message #59430
- TST (string dtype): un-xfail string tests specific to object dtype #59433
- Fix tests:
- IO - SQL
- IO - pytables
- IO - parser
- IO - parquet/feather/orc: String dtype: fix pyarrow-based IO + update tests #59478
- ...
Open design questions / behaviour changes to implement:
- API: string dtype propagation of NaNs in predicate methods (eg .str.startswith) #54805
- API: behaviour for the "str.center()" string method for the pyarrow-backed string dtype #54807
- Default string dtype should not raise fallback performance warnings #58581
Known bugs that need to be fixed:
- BUG: Index.get_indexer casts values is given as list #55834
- BUG (string dtype): looking up missing value in the Index fails #59879
- BUG: Index.get_indexer will change behaviour for nulls with arrow strings #55833
- BUG: inferring the future string dtype doesn't work for pyarrow's "large_string" type #54798
- BUG:
testing.assert_frame_equal
unhelpful error message forstring[pyarrow]
#54190 - BUG (string dtype): logical operation with bool and string failing #60234
- BUG (string dtype): comparison of string column to mixed object column fails #60228
- BUG/API: sum of a string column with all-NaN or empty #60229
Documentation:
- List and document all breaking / behaviour changes in upgrade guide
[original issue body]
With PDEP-10 (#52711, https://pandas.pydata.org/pdeps/0010-required-pyarrow-dependency.html), we decided to start using pyarrow for the default string data type in pandas 3.0.
For pandas 2.1, an option was added to already enable this future default data type, and then the various ways to construct a DataFrame (type inference in the constructor, IO methods) will use the new string dtype as default:
>>> pd.options.future.infer_string = True
>>> pd.Series(["a", "b", None])
0 a
1 b
2 NaN
dtype: string
>>> pd.Series(["a", "b", None]).dtype
string[pyarrow_numpy]
This is documented at https://pandas.pydata.org/docs/dev/whatsnew/v2.1.0.html#whatsnew-210-enhancements-infer-strings
One aspect that was discussed after the PDEP (mostly at the sprint, I think; creating this issue for a better public record of it), is that for a data type that would become the default in pandas 3.0 (which for the rest still uses all numpy dtypes with numpy NaN missing value semantics), should probably also still use the same default semantics and result in numpy data types when doing operations on the string column that result in a boolean or numeric data type (eg .str.startswith(..)
, .str.len(..)
, .str.count()
, etc, or comparison operators like ==
).
(this way, a user only gets an ArrowDtype column when explicitly asking for those, and not by default through using a the default string dtype)
To achieve this, @phofl has done several PRs to refactor the current pyarrow-based string dtypes, to add another variant which uses StringDtype(storage="pyarrow_numpy")
instead of ArrowDtype("string")
. From the updated whatsnew: "This is a new string dtype implementation that follows NumPy semantics in comparison operations and will return np.nan
as the missing value indicator". Main PR:
plus some follow-ups (#54720, #54585, #54591).
cc @pandas-dev/pandas-core