-
-
Notifications
You must be signed in to change notification settings - Fork 18.5k
ENH: Incorproate ArrowDtype into ArrowExtensionArray #47034
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Changes from 13 commits
9053263
088f72e
aee3dc8
aa13af8
d521264
ce05407
bf0365b
01e4a4b
cc1c687
97967a5
f2d872d
a77ea6b
26e8998
a157e51
baeae04
c33c345
901e9b0
b3f6d93
80059d5
5c873d5
1160bff
68bb030
9fd9161
939e751
1a5d3ff
01ca1c7
f2dda8c
26b2f1c
95bd38f
a455b50
8d6ebb5
b6972a5
0024d9e
a18fd6f
f6b779d
9edb6a4
d074188
f8983ad
1b6fe93
eedffc2
c69d70e
245fbe6
91aaaab
1a44a6d
86e178c
4129e37
c5d029f
4743781
File filter
Filter by extension
Conversations
Jump to
Diff view
Diff view
There are no files selected for viewing
Original file line number | Diff line number | Diff line change |
---|---|---|
|
@@ -8,23 +8,34 @@ | |
|
||
from pandas.core.dtypes.base import StorageExtensionDtype | ||
|
||
from pandas.core.arrays.arrow import ArrowExtensionArray | ||
|
||
|
||
class ArrowDtype(StorageExtensionDtype): | ||
""" | ||
Base class for dtypes for BaseArrowArray subclasses. | ||
Base class for dtypes for ArrowExtensionArray. | ||
Modeled after BaseMaskedDtype | ||
""" | ||
|
||
name: str | ||
base = None | ||
type: pa.DataType | ||
|
||
na_value = pa.NA | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. This wasn't discussed in #46774, but so I think this should be For example StringDtype("pyarrow") is also using pd.NA. There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Hmm so if users want to actually want to access the underlying data, they need to convert the array to arrow first? And this also means we need to ensure we are never building a new There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more.
If we want that users can access the underlying data, then we should probably make the
pyarrow actually handles
But so this might need a There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. One additional comment on this:
This is also the case for non-null values. When you access a scalar value ( There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Does pa.NA have the same semantics as pd.NA? There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. No, and that's what I meant with " Note this is the case in general for the pyarrow scalar objects. They are all light wrappers around the C++ data, exposing a method to convert it to a proper python scalar, and for the rest having no functionality to be used as a scalar object in general context (eg the int scalar also can't be used as a number, for example adding numbers or comparing values). There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. This extension array defines Gotcha, I see that There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Thanks for clarifying joris. I don't have an opinion on this per se, just advice that pd.NA should only be used here if you very specifically want pd.NA semantics. Otherwise None or np.nan or just punting should be considered. There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. I do think we specifically want |
||
|
||
def __init__(self, storage="pyarrow") -> None: | ||
super().__init__(storage) | ||
def __init__(self, pa_dtype: pa.DataType) -> None: | ||
super().__init__("pyarrow") | ||
if not isinstance(pa_dtype, pa.DataType): | ||
raise ValueError("pa_dtype must be an instance of a pyarrow.DataType") | ||
self.pa_dtype = pa_dtype | ||
mroeschke marked this conversation as resolved.
Show resolved
Hide resolved
|
||
|
||
@property | ||
def type(self): | ||
""" | ||
The scalar type for the array, e.g. ``int`` | ||
""" | ||
return self.pa_dtype | ||
|
||
@property | ||
def name(self) -> str: # type: ignore[override] | ||
""" | ||
A string identifying the data type. | ||
""" | ||
return str(self.pa_dtype) | ||
|
||
@cache_readonly | ||
def numpy_dtype(self) -> np.dtype: | ||
|
@@ -49,6 +60,8 @@ def construct_array_type(cls): | |
------- | ||
type | ||
""" | ||
from pandas.core.arrays.arrow import ArrowExtensionArray | ||
|
||
return ArrowExtensionArray | ||
|
||
@classmethod | ||
|
@@ -59,29 +72,44 @@ def construct_from_string(cls, string: str): | |
Parameters | ||
---------- | ||
string : str | ||
string should follow the format f"{pyarrow_type}[pyarrow]" | ||
e.g. int64[pyarrow] | ||
mroeschke marked this conversation as resolved.
Show resolved
Hide resolved
|
||
""" | ||
if not isinstance(string, str): | ||
raise TypeError( | ||
f"'construct_from_string' expects a string, got {type(string)}" | ||
) | ||
if string == f"{cls.name}[pyarrow]": | ||
return cls(storage="pyarrow") | ||
raise TypeError(f"Cannot construct a '{cls.__name__}' from '{string}'") | ||
if not string.endswith("[pyarrow]"): | ||
raise TypeError(f"string {string} must end with '[pyarrow]'") | ||
base_type = string.split("[pyarrow]")[0] | ||
pa_dtype = getattr(pa, base_type, None) | ||
if pa_dtype is None: | ||
raise TypeError(f"'{base_type}' is not a valid pyarrow data type.") | ||
return cls(pa_dtype()) | ||
|
||
@property | ||
def _is_numeric(self) -> bool: | ||
""" | ||
Whether columns with this dtype should be considered numeric. | ||
""" | ||
# TODO: pa.types.is_boolean? | ||
return ( | ||
pa.types.is_integer(self.pa_dtype) | ||
or pa.types.is_floating(self.pa_dtype) | ||
or pa.types.is_decimal(self.pa_dtype) | ||
) | ||
|
||
@classmethod | ||
def from_numpy_dtype(cls, dtype: np.dtype) -> ArrowDtype: | ||
@property | ||
def _is_boolean(self) -> bool: | ||
""" | ||
Construct the ArrowDtype corresponding to the given numpy dtype. | ||
Whether this dtype should be considered boolean. | ||
""" | ||
# TODO: This may be incomplete | ||
pa_dtype = pa.from_numpy_dtype(dtype) | ||
if pa_dtype is cls.type: | ||
return cls() | ||
raise NotImplementedError(dtype) | ||
return pa.types.is_boolean(self.pa_dtype) | ||
|
||
def _get_common_dtype(self, dtypes: list[DtypeObj]) -> DtypeObj | None: | ||
# We unwrap any masked dtypes, find the common dtype we would use | ||
# for that, then re-mask the result. | ||
# Mirrors BaseMaskedDtype | ||
from pandas.core.dtypes.cast import find_common_type | ||
|
||
new_dtype = find_common_type( | ||
|
@@ -91,12 +119,9 @@ def _get_common_dtype(self, dtypes: list[DtypeObj]) -> DtypeObj | None: | |
] | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Something to note somewhere (rather for a follow-up) that the above will only work for pyarrow types that have a matching numpy dtype, which is not the case for all types (eg decimal, dictionary, date, nested types, etc) There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. I defaulted the There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. I am not sure that will help (but without having tests that cover this it is hard to say). If For example, if you have one array of decimal and one of float, the common dtype could be float (not sure we want to do this, but let's assume for the example). With the current implementation, the So at some point, we should probably include pyarrow-specific logic in here that doesn't rely on converting to a numpy dtype and numpy's notion of a common type. There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more.
Agreed, and makes sense that this shouldn't be |
||
) | ||
if not isinstance(new_dtype, np.dtype): | ||
# If we ever support e.g. Masked[DatetimeArray] then this will change | ||
return None | ||
try: | ||
return type(self).from_numpy_dtype(new_dtype) | ||
except (KeyError, NotImplementedError): | ||
return None | ||
pa_dtype = pa.from_numpy_dtype(new_dtype) | ||
mroeschke marked this conversation as resolved.
Show resolved
Hide resolved
|
||
return type(self)(pa_dtype) | ||
|
||
def __from_arrow__(self, array: pa.Array | pa.ChunkedArray): | ||
""" | ||
|
Uh oh!
There was an error while loading. Please reload this page.