Description
I've been using the new Arrow backed dtypes, and I'm a bit confused on how it is decided which backend is used. One example:
>>> with pandas.option_context("mode.dtype_backend", "pyarrow"):
... pandas.Series([1, 2, 3, 4])
...
0 1
1 2
2 3
3 4
dtype: int64
Why is setting the dtype_backend
to pyarrow
not enough to use Arrow in the Series
constructor when no dtype is specified?
Also, when using for example read_csv
:
>>> import pandas
>>> pandas.read_csv('test.csv').dtypes
name object
age int64
dtype: object
>>> pandas.read_csv('test.csv', use_nullable_dtypes=True).dtypes
name string[python]
age Int64
dtype: object
>>> with pandas.option_context("mode.dtype_backend", "pyarrow"):
... pandas.read_csv('test.csv').dtypes
...
name object
age int64
dtype: object
>>> with pandas.option_context("mode.dtype_backend", "pyarrow"):
... pandas.read_csv('test.csv', use_nullable_dtypes=True).dtypes
...
name string[pyarrow]
age int64[pyarrow]
dtype: object
Why again is not enough that the user set the backend to pyarrow
to use Arrow dtypes, and needs to call use_nullable_dtypes
? This s what we returned, which doesn't make sense to me:
dtype_backend=None | dtype_backend=pyarrow | |
---|---|---|
use_nullable_dtypes=False | NumPy | NumPy ??? |
use_nullable_dtypes=True | Arrow+NumPy nullables | Arrow |
What I would expect:
dtype_backend=None | dtype_backend=pyarrow | |
---|---|---|
use_nullable_dtypes=False | NumPy | Arrow |
use_nullable_dtypes=True | Arrow eventually, Arrow+Numpy nullables for now | Arrow |
Sorry if I missed the discussion, maybe I'm just missing something. But I don't see what's the use case for a user to explicitly say they want Arrow types with the option, but still giving them NumPy backed series and dataframes... Is this something it was agreed, or we just didn't make the changes to have a more intuitive behavior?
CC: @mroeschke