Description
Copying part of the discussion out of #32586 into a more specific issue.
Problem statement: currently, _from_sequence
is not very explicit in what it should accept as scalars. In practice, this means that it is mostly very liberal, as it is also used under the hood when creating an array of any list-like of objects in pd.array(.., dtype)
.
However, in some cases we need a more strict version that only accepts actual scalars of the array (meaning, the type of values you get from array[0]
or array.max()
in case it supports that kind of reductions). This causes some issues like #31108.
So, what should _from_sequence
accept? Should it only be sequences that are unambiguously this dtype?
I think it will be useful to have a "strict" version that basically only accepts instances of ExtensionDtype.type or NA values. But we also still need a "liberal" method for the other use cases like pd.array(.., dtype)
.
The strict version would be used when, for some reason, we go through object dtype (or a list of scalars, or something equivalent). For example in groupby, where we assemble a list of scalars from the reductions into a new column.
From a testing point of view, that would mean we can test that EA._the_strict_method(np.asarray(EA, dtype=object), dtype=EA.dtype)
and EA._the_strict_method(list(EA), dtype=EA.dtype)
can roundtrip faithfully.
Assuming we agree that we need a strict version for certain use cases, I think there are two main options:
-
Keep
_from_sequence
as is, and add a new_from_scalars
method that is more strict (that in the base class can call_from_sequence
initially for backwards compatibility). We can use_from_scalars
in those cases where we need the strict version, and keep using_from_sequence
elsewhere (eg inpd.array(.., dtype=)
) -
Update the expectation in our spec that
_from_sequence
should only accept a sequence of scalars of the array's type (so make_from_sequence
the strict method), and use theastype
machinery for construction. Basically, the current flexible_from_sequence
would then be equivalent to casting an object ndarray (or generally any type) to your EA dtype.
Are there preferences? (or other options?)
From a backwards compatibility point of view, I think both are similar (in both cases you need to update a method (_from_scalars
or _from_sequence
), and in both cases initially the flexible version will still be used as fallback until the update is done).
The second option of course requires an update to the astype machinery (#22384), which doesn't exist today, but on the other hand is also something we need to do at some point eventually (but a much bigger topic to solve).