Skip to content

API: EA interface - strictness of _from_sequence #33254

Closed
@jorisvandenbossche

Description

@jorisvandenbossche

Copying part of the discussion out of #32586 into a more specific issue.

Problem statement: currently, _from_sequence is not very explicit in what it should accept as scalars. In practice, this means that it is mostly very liberal, as it is also used under the hood when creating an array of any list-like of objects in pd.array(.., dtype).
However, in some cases we need a more strict version that only accepts actual scalars of the array (meaning, the type of values you get from array[0] or array.max() in case it supports that kind of reductions). This causes some issues like #31108.

So, what should _from_sequence accept? Should it only be sequences that are unambiguously this dtype?

I think it will be useful to have a "strict" version that basically only accepts instances of ExtensionDtype.type or NA values. But we also still need a "liberal" method for the other use cases like pd.array(.., dtype).

The strict version would be used when, for some reason, we go through object dtype (or a list of scalars, or something equivalent). For example in groupby, where we assemble a list of scalars from the reductions into a new column.
From a testing point of view, that would mean we can test that EA._the_strict_method(np.asarray(EA, dtype=object), dtype=EA.dtype) and EA._the_strict_method(list(EA), dtype=EA.dtype) can roundtrip faithfully.


Assuming we agree that we need a strict version for certain use cases, I think there are two main options:

  1. Keep _from_sequence as is, and add a new _from_scalars method that is more strict (that in the base class can call _from_sequence initially for backwards compatibility). We can use _from_scalars in those cases where we need the strict version, and keep using _from_sequence elsewhere (eg in pd.array(.., dtype=))

  2. Update the expectation in our spec that _from_sequence should only accept a sequence of scalars of the array's type (so make _from_sequence the strict method), and use the astype machinery for construction. Basically, the current flexible _from_sequence would then be equivalent to casting an object ndarray (or generally any type) to your EA dtype.

Are there preferences? (or other options?)

From a backwards compatibility point of view, I think both are similar (in both cases you need to update a method (_from_scalars or _from_sequence), and in both cases initially the flexible version will still be used as fallback until the update is done).

The second option of course requires an update to the astype machinery (#22384), which doesn't exist today, but on the other hand is also something we need to do at some point eventually (but a much bigger topic to solve).

Metadata

Metadata

Assignees

No one assigned

    Labels

    API DesignConstructorsSeries/DataFrame/Index/pd.array ConstructorsExtensionArrayExtending pandas with custom dtypes or arrays.

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions