API: EA interface - strictness of _from_sequence

Copying part of the discussion out of https://github.com/pandas-dev/pandas/issues/32586 into a more specific issue.

Problem statement: currently, `_from_sequence` is not very explicit in what it should accept as scalars. In practice, this means that it is mostly very liberal, as it is also used under the hood when creating an array of any list-like of objects in `pd.array(.., dtype)`. 
However, in some cases we need a more strict version that *only* accepts actual scalars of the array (meaning, the type of values you get from `array[0]` or `array.max()` in case it supports that kind of reductions). This causes some issues like https://github.com/pandas-dev/pandas/issues/31108.

So, what should `_from_sequence` accept? Should it only be sequences that are unambiguously this dtype?

I think it will be useful to have a "strict" version that basically only accepts instances of ExtensionDtype.type or NA values. But we also still need a "liberal" method for the other use cases like `pd.array(.., dtype)`.

The strict version would be used when, for some reason, we go through object dtype (or a list of scalars, or something equivalent). For example in groupby, where we assemble a list of scalars from the reductions into a new column. 
From a testing point of view, that would mean we can test that `EA._the_strict_method(np.asarray(EA, dtype=object), dtype=EA.dtype)` and   `EA._the_strict_method(list(EA), dtype=EA.dtype)` can roundtrip faithfully.

---

Assuming we agree that we need a strict version for certain use cases, I think there are two main options:

1) Keep `_from_sequence` as is, and add a new `_from_scalars` method that is more strict (that in the base class can call `_from_sequence` initially for backwards compatibility). We can use `_from_scalars` in those cases where we need the strict version, and keep using `_from_sequence` elsewhere (eg in `pd.array(.., dtype=)`)

2) Update the expectation in our spec that `_from_sequence` should only accept a sequence of scalars of the array's type (so make `_from_sequence` the strict method), and use the `astype` machinery for construction. Basically, the current flexible `_from_sequence` would then be equivalent to casting an object ndarray (or generally any type) to your EA dtype.

Are there preferences? (or other options?)

From a backwards compatibility point of view, I think both are similar (in both cases you need to update a method (`_from_scalars` or `_from_sequence`), and in both cases initially the flexible version will still be used as fallback until the update is done).

The second option of course requires an update to the astype machinery (#22384), which doesn't exist today, but on the other hand is also something we need to do at some point eventually (but a much bigger topic to solve).


Provide feedback

Saved searches

Use saved searches to filter your results more quickly

API: EA interface - strictness of _from_sequence #33254

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

API: EA interface - strictness of _from_sequence #33254

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions