Skip to content

Public Data Followups #23995

Closed
Closed
@TomAugspurger

Description

@TomAugspurger

leftover from #23623

  1. Signature for .to_numpy(): @jorisvandenbossche proposed copy=True, which I think is good. Beyond that, we may want to control the "fidelity" of the conversion. Should Series[datetime64[ns, tz]].to_numpy() be an ndarray of Timestamp objets or an ndarray of dateimte64[ns] normalized to UTC (by default, and should we allow that to be controlled)? Can we hope for a set of keywords appropriate for all subtypes, or do we need to allow kwargs? Perhaps to_numpy(copy=True, dtype=None) will suffice?

  2. Make .array always an ExtensionArray (via @shoyer). This gives pandas a bit more freedom going forward, since the type of .array will be stable if / when we flip over to Arrow arrays by default. We'll just swap out the data backing the ExtensionArray. A generic "NumpyBackedExtensionArray" is pretty easy to write (I had one in cyberpandas). My main concern here is that it makes the statement ".array is the actual data stored in the Series / Index" falseish, but that's OK.

  3. Revert the breaking changes to Series.values for period and interval dtype data (cc @jschendel)? I think we should do this.

In [3]: sper = pd.Series(pd.period_range('2000', periods=4))

In [4]: sper.values  # on master this is the PeriodArray
Out[4]:
array([Period('2000-01-01', 'D'), Period('2000-01-02', 'D'),
       Period('2000-01-03', 'D'), Period('2000-01-04', 'D')], dtype=object)

In [5]: sper.array
Out[5]:
<PeriodArray>
['2000-01-01', '2000-01-02', '2000-01-03', '2000-01-04']
Length: 4, dtype: period[D]

In terms of LOC, it's a simple change

@@ -1984,6 +1984,16 @@ class ExtensionBlock(NonConsolidatableMixIn, Block):
         return blocks, mask


+class ObjectValuesExtensionBlock(ExtensionBlock):
+    """Block for Interval / Period data.
+
+    Only needed for backwards compatability to ensure that
+    Series[T].values is an ndarray of objects.
+    """
+    def external_values(self, dtype=None):
+        return self.values.astype(object)
+
+
 class NumericBlock(Block):
     __slots__ = ()
     is_numeric = True
@@ -3004,6 +3014,8 @@ def get_block_type(values, dtype=None):

     if is_categorical(values):
         cls = CategoricalBlock
+    elif is_interval_dtype(dtype) or is_period_dtype(dtype):
+        cls = ObjectValuesExtensionBlock

There are a couple other places (like Series._ndarray_values) that assume "extension dtype means .values is an ExtensionArray", which I've surfaced on my DatetimeArray branch. We'll need to update those to use .array anyway.


Metadata

Metadata

Assignees

No one assigned

    Labels

    API DesignDtype ConversionsUnexpected or buggy dtype conversionsExtensionArrayExtending pandas with custom dtypes or arrays.

    Type

    No type

    Projects

    No projects

    Milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions