-
-
Notifications
You must be signed in to change notification settings - Fork 18.5k
REF: Implement NDArrayBackedExtensionArray #33660
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Changes from all commits
85ea4fa
d43950c
aad9970
95db99f
42ca6a7
fc304a0
File filter
Filter by extension
Conversations
Jump to
Diff view
Diff view
There are no files selected for viewing
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,62 @@ | ||
from typing import Any, Sequence, TypeVar | ||
|
||
import numpy as np | ||
|
||
from pandas.errors import AbstractMethodError | ||
|
||
from pandas.core.algorithms import take | ||
from pandas.core.arrays.base import ExtensionArray | ||
|
||
_T = TypeVar("_T", bound="NDArrayBackedExtensionArray") | ||
|
||
|
||
class NDArrayBackedExtensionArray(ExtensionArray): | ||
""" | ||
ExtensionArray that is backed by a single NumPy ndarray. | ||
""" | ||
|
||
_ndarray: np.ndarray | ||
|
||
def _from_backing_data(self: _T, arr: np.ndarray) -> _T: | ||
""" | ||
Construct a new ExtensionArray `new_array` with `arr` as its _ndarray. | ||
|
||
This should round-trip: | ||
self == self._from_backing_data(self._ndarray) | ||
""" | ||
raise AbstractMethodError(self) | ||
|
||
# ------------------------------------------------------------------------ | ||
|
||
def take( | ||
self: _T, | ||
indices: Sequence[int], | ||
allow_fill: bool = False, | ||
fill_value: Any = None, | ||
) -> _T: | ||
if allow_fill: | ||
fill_value = self._validate_fill_value(fill_value) | ||
|
||
new_data = take( | ||
self._ndarray, indices, allow_fill=allow_fill, fill_value=fill_value, | ||
) | ||
return self._from_backing_data(new_data) | ||
|
||
def _validate_fill_value(self, fill_value): | ||
""" | ||
If a fill_value is passed to `take` convert it to a representation | ||
suitable for self._ndarray, raising ValueError if this is not possible. | ||
|
||
Parameters | ||
---------- | ||
fill_value : object | ||
|
||
Returns | ||
------- | ||
fill_value : native representation | ||
|
||
Raises | ||
------ | ||
ValueError | ||
""" | ||
raise AbstractMethodError(self) | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. BaseMaskedArray is in core/arrays/masked.py. If the purpose of this is to create a common base class for another subset of extensionarrays, can we make the module names for the base classes more consistent. We could also restructure the extension array tests with a similar hierarchy, so that could influence the naming. There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. I'm open to other naming ideas; what do you have in mind? There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. we already have core/arrays/numpy_ for PandasArray, so the simple numpy name is taken. The difference though is a PandasArray can be instantiated (It is also possible to instantiate a BaseMaskedArray directly) whereas NDArrayBackedExtensionArray is abstract. maybe we need a subdirectory of core/arrays for the base classes. There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. could reasonably put the mixin in arrays.numpy_ There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. sounds good. |
Original file line number | Diff line number | Diff line change |
---|---|---|
|
@@ -49,9 +49,10 @@ | |
from pandas.core import ops | ||
from pandas.core.accessor import PandasDelegate, delegate_names | ||
import pandas.core.algorithms as algorithms | ||
from pandas.core.algorithms import _get_data_algo, factorize, take, take_1d, unique1d | ||
from pandas.core.algorithms import _get_data_algo, factorize, take_1d, unique1d | ||
from pandas.core.array_algos.transforms import shift | ||
from pandas.core.arrays.base import ExtensionArray, _extension_array_shared_docs | ||
from pandas.core.arrays._mixins import _T, NDArrayBackedExtensionArray | ||
from pandas.core.arrays.base import _extension_array_shared_docs | ||
from pandas.core.base import NoNewAttributesMixin, PandasObject, _shared_docs | ||
import pandas.core.common as com | ||
from pandas.core.construction import array, extract_array, sanitize_array | ||
|
@@ -199,7 +200,7 @@ def contains(cat, key, container): | |
return any(loc_ in container for loc_ in loc) | ||
|
||
|
||
class Categorical(ExtensionArray, PandasObject): | ||
class Categorical(NDArrayBackedExtensionArray, PandasObject): | ||
""" | ||
Represent a categorical variable in classic R / S-plus fashion. | ||
|
||
|
@@ -1238,7 +1239,7 @@ def shift(self, periods, fill_value=None): | |
|
||
def _validate_fill_value(self, fill_value): | ||
""" | ||
Convert a user-facing fill_value to a representation to use with our | ||
Convert a user-facing fill_value to a representation to use with our | ||
underlying ndarray, raising ValueError if this is not possible. | ||
|
||
Parameters | ||
|
@@ -1768,7 +1769,7 @@ def fillna(self, value=None, method=None, limit=None): | |
|
||
return self._constructor(codes, dtype=self.dtype, fastpath=True) | ||
|
||
def take(self, indexer, allow_fill: bool = False, fill_value=None): | ||
def take(self: _T, indexer, allow_fill: bool = False, fill_value=None) -> _T: | ||
""" | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Probably want to restore this docstring. There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. restored with TypeError->ValueError and whatsnew |
||
Take elements from the Categorical. | ||
|
||
|
@@ -1837,16 +1838,23 @@ def take(self, indexer, allow_fill: bool = False, fill_value=None): | |
Categories (2, object): [a, b] | ||
|
||
Specifying a fill value that's not in ``self.categories`` | ||
will raise a ``TypeError``. | ||
will raise a ``ValueError``. | ||
""" | ||
indexer = np.asarray(indexer, dtype=np.intp) | ||
return NDArrayBackedExtensionArray.take( | ||
self, indexer, allow_fill=allow_fill, fill_value=fill_value | ||
) | ||
|
||
if allow_fill: | ||
# convert user-provided `fill_value` to codes | ||
fill_value = self._validate_fill_value(fill_value) | ||
# ------------------------------------------------------------------ | ||
# NDArrayBackedExtensionArray compat | ||
|
||
codes = take(self._codes, indexer, allow_fill=allow_fill, fill_value=fill_value) | ||
return self._constructor(codes, dtype=self.dtype, fastpath=True) | ||
@property | ||
def _ndarray(self) -> np.ndarray: | ||
return self._codes | ||
|
||
def _from_backing_data(self, arr: np.ndarray) -> "Categorical": | ||
return self._constructor(arr, dtype=self.dtype, fastpath=True) | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. see other comment re _constructor. There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more.
in PandasObject _constructor is type(self), in Categorical, _constructor is Categorical (i.e type(self)). so I think it may worth investigating whether we can remove _constructor from Categorical and use type(self) here to be consistent with datetimelike._from_backing_data. can u add return types for _from_backing_data, both here and in datetimelike |
||
|
||
# ------------------------------------------------------------------ | ||
|
||
def take_nd(self, indexer, allow_fill: bool = False, fill_value=None): | ||
# GH#27745 deprecate alias that other EAs dont have | ||
|
Original file line number | Diff line number | Diff line change |
---|---|---|
|
@@ -39,8 +39,9 @@ | |
from pandas.core.dtypes.missing import is_valid_nat_for_dtype, isna | ||
|
||
from pandas.core import missing, nanops, ops | ||
from pandas.core.algorithms import checked_add_with_arr, take, unique1d, value_counts | ||
from pandas.core.algorithms import checked_add_with_arr, unique1d, value_counts | ||
from pandas.core.array_algos.transforms import shift | ||
from pandas.core.arrays._mixins import _T, NDArrayBackedExtensionArray | ||
from pandas.core.arrays.base import ExtensionArray, ExtensionOpsMixin | ||
import pandas.core.common as com | ||
from pandas.core.construction import array, extract_array | ||
|
@@ -425,7 +426,9 @@ def _with_freq(self, freq): | |
return self | ||
|
||
|
||
class DatetimeLikeArrayMixin(ExtensionOpsMixin, AttributesMixin, ExtensionArray): | ||
class DatetimeLikeArrayMixin( | ||
ExtensionOpsMixin, AttributesMixin, NDArrayBackedExtensionArray | ||
): | ||
""" | ||
Shared Base/Mixin class for DatetimeArray, TimedeltaArray, PeriodArray | ||
|
||
|
@@ -437,6 +440,20 @@ class DatetimeLikeArrayMixin(ExtensionOpsMixin, AttributesMixin, ExtensionArray) | |
_generate_range | ||
""" | ||
|
||
# ------------------------------------------------------------------ | ||
# NDArrayBackedExtensionArray compat | ||
|
||
@property | ||
def _ndarray(self) -> np.ndarray: | ||
# NB: A bunch of Interval tests fail if we use ._data | ||
return self.asi8 | ||
|
||
def _from_backing_data(self: _T, arr: np.ndarray) -> _T: | ||
# Note: we do not retain `freq` | ||
return type(self)(arr, dtype=self.dtype) # type: ignore | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. unlike Frame and Series, EAs do not have the concept of _constructor. could this be useful? the exception is Categorical which uses _constructor. so maybe either extend to all EAs or can we remove from Categorical for consistency. There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. IIUC Categorical has _constructor because it subclasses PandasObject.
|
||
|
||
# ------------------------------------------------------------------ | ||
|
||
@property | ||
def ndim(self) -> int: | ||
return self._data.ndim | ||
|
@@ -665,16 +682,6 @@ def unique(self): | |
result = unique1d(self.asi8) | ||
return type(self)(result, dtype=self.dtype) | ||
|
||
def take(self, indices, allow_fill=False, fill_value=None): | ||
if allow_fill: | ||
fill_value = self._validate_fill_value(fill_value) | ||
|
||
new_values = take( | ||
self.asi8, indices, allow_fill=allow_fill, fill_value=fill_value | ||
) | ||
|
||
return type(self)(new_values, dtype=self.dtype) | ||
|
||
@classmethod | ||
def _concat_same_type(cls, to_concat, axis: int = 0): | ||
|
||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
is typing
self
needed?There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
my understanding is that is how we indicate that the return type is "same type as self", but id defer to @simonjayhawkins on this
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
yes. T is a typevar, i.e. can take on different types (could be subtypes of a type or union of types) so adding to self binds the typevar and the return type is the same as self.
This can sometimes cause problems in Mixins, but here the 'Mixin' is IMO an abstract base class.