Skip to content

ENH: Incorproate ArrowDtype into ArrowExtensionArray #47034

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 48 commits into from
Jun 9, 2022
Merged
Show file tree
Hide file tree
Changes from 13 commits
Commits
Show all changes
48 commits
Select commit Hold shift + click to select a range
9053263
Add other dtype attributes
mroeschke May 16, 2022
088f72e
Merge remote-tracking branch 'upstream/main' into enh/arrowdtype_support
mroeschke May 16, 2022
aee3dc8
add pa_type in the constructor and modify methods of needed
mroeschke May 16, 2022
aa13af8
Have ArrowExtensionArray support ArrowDtype
mroeschke May 16, 2022
d521264
Fix tests
mroeschke May 16, 2022
ce05407
add impoterror raise
mroeschke May 17, 2022
bf0365b
Just partial match
mroeschke May 17, 2022
01e4a4b
Address typing
mroeschke May 17, 2022
cc1c687
Merge remote-tracking branch 'upstream/main' into enh/arrowdtype_support
mroeschke May 17, 2022
97967a5
Merge remote-tracking branch 'upstream/main' into enh/arrowdtype_support
mroeschke May 17, 2022
f2d872d
Merge remote-tracking branch 'upstream/main' into enh/arrowdtype_support
mroeschke May 19, 2022
a77ea6b
Merge remote-tracking branch 'upstream/main' into enh/arrowdtype_support
mroeschke May 19, 2022
26e8998
Merge remote-tracking branch 'upstream/main' into enh/arrowdtype_support
mroeschke May 19, 2022
a157e51
Merge remote-tracking branch 'upstream/main' into enh/arrowdtype_support
mroeschke May 20, 2022
baeae04
Merge remote-tracking branch 'upstream/main' into enh/arrowdtype_support
mroeschke May 23, 2022
c33c345
Complete more methods of extentionarrow
mroeschke May 24, 2022
901e9b0
Add types and first test
mroeschke May 24, 2022
b3f6d93
Fix getitem type thing
mroeschke May 24, 2022
80059d5
Merge remote-tracking branch 'upstream/main' into enh/arrowdtype_support
mroeschke May 24, 2022
5c873d5
Try import or skip:
mroeschke May 24, 2022
1160bff
Merge remote-tracking branch 'upstream/main' into enh/arrowdtype_support
mroeschke May 24, 2022
68bb030
Fix typo
mroeschke May 24, 2022
9fd9161
Fix data size, coersion of pa.NA in lists
mroeschke May 24, 2022
939e751
change pa_dtype to pyarrow dtype
mroeschke May 25, 2022
1a5d3ff
Address more tests
mroeschke May 25, 2022
01ca1c7
Merge remote-tracking branch 'upstream/main' into enh/arrowdtype_support
mroeschke May 25, 2022
f2dda8c
Add register_extension_dtype
mroeschke May 25, 2022
26b2f1c
Address Joris' comments
mroeschke May 25, 2022
95bd38f
Revert to self.name, xfail the dtype test due to conflict
mroeschke May 25, 2022
a455b50
Add getitem tests
mroeschke May 25, 2022
8d6ebb5
Merge remote-tracking branch 'upstream/main' into enh/arrowdtype_support
mroeschke May 25, 2022
b6972a5
Add conditions when fails for other pyarrow versions
mroeschke May 25, 2022
0024d9e
Merge remote-tracking branch 'upstream/main' into enh/arrowdtype_support
mroeschke May 26, 2022
a18fd6f
Fix docstring validate
mroeschke May 26, 2022
f6b779d
Merge remote-tracking branch 'upstream/main' into enh/arrowdtype_support
mroeschke May 26, 2022
9edb6a4
Fix typing errors
mroeschke May 26, 2022
d074188
Merge remote-tracking branch 'upstream/main' into enh/arrowdtype_support
mroeschke May 27, 2022
f8983ad
Merge remote-tracking branch 'upstream/main' into enh/arrowdtype_support
mroeschke May 30, 2022
1b6fe93
Merge remote-tracking branch 'upstream/main' into enh/arrowdtype_support
mroeschke May 31, 2022
eedffc2
Merge remote-tracking branch 'upstream/main' into enh/arrowdtype_support
mroeschke May 31, 2022
c69d70e
Remove incorrectly implemented _from_factorized
mroeschke May 31, 2022
245fbe6
Add notimplementederror for construct_from_string with parameters
mroeschke May 31, 2022
91aaaab
Merge remote-tracking branch 'upstream/main' into enh/arrowdtype_support
mroeschke Jun 1, 2022
1a44a6d
Address review
mroeschke Jun 1, 2022
86e178c
Add pyarrow_dtype to _metadata
mroeschke Jun 1, 2022
4129e37
Merge remote-tracking branch 'upstream/main' into enh/arrowdtype_support
mroeschke Jun 1, 2022
c5d029f
Address typing and fix data fixture
mroeschke Jun 1, 2022
4743781
Merge remote-tracking branch 'upstream/main' into enh/arrowdtype_support
mroeschke Jun 5, 2022
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
20 changes: 16 additions & 4 deletions pandas/core/arrays/arrow/array.py
Original file line number Diff line number Diff line change
Expand Up @@ -39,6 +39,7 @@
import pyarrow.compute as pc

from pandas.core.arrays.arrow._arrow_utils import fallback_performancewarning
from pandas.core.arrays.arrow.dtype import ArrowDtype

if TYPE_CHECKING:
from pandas import Series
Expand All @@ -48,16 +49,27 @@

class ArrowExtensionArray(ExtensionArray):
"""
Base class for ExtensionArray backed by Arrow array.
Base class for ExtensionArray backed by Arrow ChunkedArray.
"""

_data: pa.ChunkedArray

def __init__(self, values: pa.ChunkedArray) -> None:
self._data = values
def __init__(self, values: pa.Array | pa.ChunkedArray) -> None:
if pa_version_under1p01:
msg = "pyarrow>=1.0.0 is required for PyArrow backed ArrowExtensionArray."
raise ImportError(msg)
if isinstance(values, pa.Array):
self._data = pa.chunked_array([values])
elif isinstance(values, pa.ChunkedArray):
self._data = values
else:
raise ValueError(
f"Unsupported type '{type(values)}' for ArrowExtensionArray"
)
self._dtype = ArrowDtype(self._data.type)

def __arrow_array__(self, type=None):
"""Convert myself to a pyarrow Array or ChunkedArray."""
"""Convert myself to a pyarrow ChunkedArray."""
return self._data

def equals(self, other) -> bool:
Expand Down
75 changes: 50 additions & 25 deletions pandas/core/arrays/arrow/dtype.py
Original file line number Diff line number Diff line change
Expand Up @@ -8,23 +8,34 @@

from pandas.core.dtypes.base import StorageExtensionDtype

from pandas.core.arrays.arrow import ArrowExtensionArray


class ArrowDtype(StorageExtensionDtype):
"""
Base class for dtypes for BaseArrowArray subclasses.
Base class for dtypes for ArrowExtensionArray.
Modeled after BaseMaskedDtype
"""

name: str
base = None
type: pa.DataType

na_value = pa.NA
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This wasn't discussed in #46774, but so I think this should be pd.NA instead of pa.NA. This is the actual object that users get when accessing a scalar missing value (not how it is stored under the hood), and so pa.NA has no useful / proper behaviour to serve as a user-facing object.

For example StringDtype("pyarrow") is also using pd.NA.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hmm so if users want to actually want to access the underlying data, they need to convert the array to arrow first?

And this also means we need to ensure we are never building a new pa.array/chunked_array with pd.NA?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hmm so if users want to actually want to access the underlying data, they need to convert the array to arrow first?

If we want that users can access the underlying data, then we should probably make the _data attribute public? Or add a to_pyarrow() method in analogy with to_numpy() (and users can actually also do pa.array(ext_arr) and that will also give you the underlying arrow array.

And this also means we need to ensure we are never building a new pa.array/chunked_array with pd.NA?

pyarrow actually handles pd.NA well (although this might require a more recent version than 1.0, didn't check), and not pa.NA (that sounds strange, I know ..):

>>> pa.array([1, pd.NA], from_pandas=True)
<pyarrow.lib.Int64Array object at 0x7fccccb81d00>
[
  1,
  null
]

>>> pa.array([1, pa.NA], from_pandas=True)
...
ArrowInvalid: Could not convert None with type pyarrow.lib.NullScalar: did not recognize Python value type when inferring an Arrow data type

But so this might need a from_pandas=True to be added in _from_sequence

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

One additional comment on this:

Hmm so if users want to actually want to access the underlying data, they need to convert the array to arrow first?

This is also the case for non-null values. When you access a scalar value (arr[i]), we convert that to a "python" scalar (the as_py() call in __getitem__), so also here you don't have access to the underlying pyarrow data / scalar.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Does pa.NA have the same semantics as pd.NA?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No, and that's what I meant with "pa.NA has no useful / proper behaviour to serve as a user-facing object.". Basically pa.NA has no semantics whatsoever, apart from being a python object (and having a as_py() method that returns None, and being equal to itself).

Note this is the case in general for the pyarrow scalar objects. They are all light wrappers around the C++ data, exposing a method to convert it to a proper python scalar, and for the rest having no functionality to be used as a scalar object in general context (eg the int scalar also can't be used as a number, for example adding numbers or comparing values).

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This extension array defines __arrow_array__, so if users want the underlying data they can call pa.array(...) (which seems reasonable to me)?

Gotcha, I see that __getitem__ coerces scalars already with as_py, so given that I agree and I think it makes more sense to return pd.NA instead of pa.NA

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for clarifying joris. I don't have an opinion on this per se, just advice that pd.NA should only be used here if you very specifically want pd.NA semantics. Otherwise None or np.nan or just punting should be considered.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I do think we specifically want pd.NA, as the behaviour of pd.NA generally matches the behaviour of nulls in Arrow arrays (eg propagating in comparisons, instead of returning False)


def __init__(self, storage="pyarrow") -> None:
super().__init__(storage)
def __init__(self, pa_dtype: pa.DataType) -> None:
super().__init__("pyarrow")
if not isinstance(pa_dtype, pa.DataType):
raise ValueError("pa_dtype must be an instance of a pyarrow.DataType")
self.pa_dtype = pa_dtype

@property
def type(self):
"""
The scalar type for the array, e.g. ``int``
"""
return self.pa_dtype

@property
def name(self) -> str: # type: ignore[override]
"""
A string identifying the data type.
"""
return str(self.pa_dtype)

@cache_readonly
def numpy_dtype(self) -> np.dtype:
Expand All @@ -49,6 +60,8 @@ def construct_array_type(cls):
-------
type
"""
from pandas.core.arrays.arrow import ArrowExtensionArray

return ArrowExtensionArray

@classmethod
Expand All @@ -59,29 +72,44 @@ def construct_from_string(cls, string: str):
Parameters
----------
string : str
string should follow the format f"{pyarrow_type}[pyarrow]"
e.g. int64[pyarrow]
"""
if not isinstance(string, str):
raise TypeError(
f"'construct_from_string' expects a string, got {type(string)}"
)
if string == f"{cls.name}[pyarrow]":
return cls(storage="pyarrow")
raise TypeError(f"Cannot construct a '{cls.__name__}' from '{string}'")
if not string.endswith("[pyarrow]"):
raise TypeError(f"string {string} must end with '[pyarrow]'")
base_type = string.split("[pyarrow]")[0]
pa_dtype = getattr(pa, base_type, None)
if pa_dtype is None:
raise TypeError(f"'{base_type}' is not a valid pyarrow data type.")
return cls(pa_dtype())

@property
def _is_numeric(self) -> bool:
"""
Whether columns with this dtype should be considered numeric.
"""
# TODO: pa.types.is_boolean?
return (
pa.types.is_integer(self.pa_dtype)
or pa.types.is_floating(self.pa_dtype)
or pa.types.is_decimal(self.pa_dtype)
)

@classmethod
def from_numpy_dtype(cls, dtype: np.dtype) -> ArrowDtype:
@property
def _is_boolean(self) -> bool:
"""
Construct the ArrowDtype corresponding to the given numpy dtype.
Whether this dtype should be considered boolean.
"""
# TODO: This may be incomplete
pa_dtype = pa.from_numpy_dtype(dtype)
if pa_dtype is cls.type:
return cls()
raise NotImplementedError(dtype)
return pa.types.is_boolean(self.pa_dtype)

def _get_common_dtype(self, dtypes: list[DtypeObj]) -> DtypeObj | None:
# We unwrap any masked dtypes, find the common dtype we would use
# for that, then re-mask the result.
# Mirrors BaseMaskedDtype
from pandas.core.dtypes.cast import find_common_type

new_dtype = find_common_type(
Expand All @@ -91,12 +119,9 @@ def _get_common_dtype(self, dtypes: list[DtypeObj]) -> DtypeObj | None:
]
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Something to note somewhere (rather for a follow-up) that the above will only work for pyarrow types that have a matching numpy dtype, which is not the case for all types (eg decimal, dictionary, date, nested types, etc)

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I defaulted the numpy_dtype to np.dtype(object) to those pyarrow types without a corresponding numpy type.

https://github.com/pandas-dev/pandas/pull/47034/files#diff-251a2430af30e9f685a07dbee6d5c024e832fc7e10569c7d43ee46bc115b422dR56

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I am not sure that will help (but without having tests that cover this it is hard to say). If .numpy_dtype returns object dtype, then the common dtype here will also be object dtype, and then pa.from_numpy_dtype below will fail (so this function will return None). That means that there can never be a proper ArrowDtype common dtype (that is not object dtype) for such arrow types.

For example, if you have one array of decimal and one of float, the common dtype could be float (not sure we want to do this, but let's assume for the example). With the current implementation, the numpy_dtype for those extension dtypes will be np.dtype(object) and np.dtype(float). The common dtype for that will always be np.dtype(object), which means that concatting such columns will result in a cast to object dtype, instead of casting to/preserving the float dtype.

So at some point, we should probably include pyarrow-specific logic in here that doesn't rely on converting to a numpy dtype and numpy's notion of a common type.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So at some point, we should probably include pyarrow-specific logic in here that doesn't rely on converting to a numpy dtype and numpy's notion of a common type.

Agreed, and makes sense that this shouldn't be object dtype in the long term. Would be great if this eventually follows pyarrow's type coercion rules if there is one :)

)
if not isinstance(new_dtype, np.dtype):
# If we ever support e.g. Masked[DatetimeArray] then this will change
return None
try:
return type(self).from_numpy_dtype(new_dtype)
except (KeyError, NotImplementedError):
return None
pa_dtype = pa.from_numpy_dtype(new_dtype)
return type(self)(pa_dtype)

def __from_arrow__(self, array: pa.Array | pa.ChunkedArray):
"""
Expand Down
10 changes: 3 additions & 7 deletions pandas/core/arrays/string_arrow.py
Original file line number Diff line number Diff line change
Expand Up @@ -76,7 +76,7 @@

def _chk_pyarrow_available() -> None:
if pa_version_under1p01:
msg = "pyarrow>=1.0.0 is required for PyArrow backed StringArray."
msg = "pyarrow>=1.0.0 is required for PyArrow backed ArrowExtensionArray."
raise ImportError(msg)


Expand Down Expand Up @@ -132,13 +132,9 @@ class ArrowStringArray(
"""

def __init__(self, values) -> None:
super().__init__(values)
# TODO: Migrate to ArrowDtype instead
self._dtype = StringDtype(storage="pyarrow")
if isinstance(values, pa.Array):
self._data = pa.chunked_array([values])
elif isinstance(values, pa.ChunkedArray):
self._data = values
else:
raise ValueError(f"Unsupported type '{type(values)}' for ArrowStringArray")

if not pa.types.is_string(self._data.type):
raise ValueError(
Expand Down
2 changes: 1 addition & 1 deletion pandas/tests/arrays/string_/test_string.py
Original file line number Diff line number Diff line change
Expand Up @@ -260,7 +260,7 @@ def test_constructor_raises(cls):
if cls is pd.arrays.StringArray:
msg = "StringArray requires a sequence of strings or pandas.NA"
else:
msg = "Unsupported type '<class 'numpy.ndarray'>' for ArrowStringArray"
msg = "Unsupported type '<class 'numpy.ndarray'>' for ArrowExtensionArray"

with pytest.raises(ValueError, match=msg):
cls(np.array(["a", "b"], dtype="S1"))
Expand Down
4 changes: 2 additions & 2 deletions pandas/tests/arrays/string_/test_string_arrow.py
Original file line number Diff line number Diff line change
Expand Up @@ -59,7 +59,7 @@ def test_constructor_not_string_type_raises(array, chunked):
pytest.skip("chunked not applicable to numpy array")
arr = pa.chunked_array(arr)
if array is np:
msg = "Unsupported type '<class 'numpy.ndarray'>' for ArrowStringArray"
msg = "Unsupported type '<class 'numpy.ndarray'>' for ArrowExtensionArray"
else:
msg = re.escape(
"ArrowStringArray requires a PyArrow (chunked) array of string type"
Expand Down Expand Up @@ -122,7 +122,7 @@ def test_from_sequence_wrong_dtype_raises():
reason="pyarrow is installed",
)
def test_pyarrow_not_installed_raises():
msg = re.escape("pyarrow>=1.0.0 is required for PyArrow backed StringArray")
msg = re.escape("pyarrow>=1.0.0 is required for PyArrow backed")

with pytest.raises(ImportError, match=msg):
StringDtype(storage="pyarrow")
Expand Down
4 changes: 2 additions & 2 deletions pandas/tests/extension/arrow/arrays.py
Original file line number Diff line number Diff line change
Expand Up @@ -185,7 +185,7 @@ def __init__(self, values) -> None:

assert values.type == pa.bool_()
self._data = values
self._dtype = ArrowBoolDtype()
self._dtype = ArrowBoolDtype() # type: ignore[assignment]


class ArrowStringArray(ArrowExtensionArray):
Expand All @@ -195,4 +195,4 @@ def __init__(self, values) -> None:

assert values.type == pa.string()
self._data = values
self._dtype = ArrowStringDtype()
self._dtype = ArrowStringDtype() # type: ignore[assignment]
2 changes: 1 addition & 1 deletion pandas/tests/extension/arrow/test_timestamp.py
Original file line number Diff line number Diff line change
Expand Up @@ -46,7 +46,7 @@ def __init__(self, values) -> None:

assert values.type == pa.timestamp("us")
self._data = values
self._dtype = ArrowTimestampUSDtype()
self._dtype = ArrowTimestampUSDtype() # type: ignore[assignment]


def test_constructor_extensionblock():
Expand Down