Description
In #29964 and #29961 NA in IntegerArray and BooleanArray), the question comes up how to handle pd.NA
's in conversion to numpy arrays.
Such conversion occurs mainly in __array__
(for np.(as)array(..)
) and .astype()
. For example:
In [3]: arr = pd.array([1, 2, pd.NA], dtype="Int64")
In [4]: np.asarray(arr)
Out[4]: array([1, 2, None/pd.NA/..?], dtype=object)
In [5]: arr.astype(float)
Out[5]: array([ 1., 2., nan]) # <--- allow automatic NA to NaN conversion?
Questions that come up here:
-
By default, when converting to object dtype, what "NA value" should be used? Before this was
NaN
orNone
, now it could logically bepd.NA
.
A possible reason to choose None instead of pd.NA is that third party code that needs a numpy array will typically not be able to handle pd.NA while None is much more normal. On the other hand, there is also still time for such third party code to adapt. And it will probably be good to keeplist(arr)
(iteration/getitem) andnp.array(arr, dtype=object)
consisetnt. -
When converting to a float dtype, are we fine to automatically convert
pd.NA
tonp.nan
? Or do we think the user should explicitly opt in for this?
We will probably want to add a to_numpy
to those Integer/BooleanArray to be able to make those choices explicit, eg with following signature:
def to_numpy(self, dtype=object, na_value=...):
...
where you can explicitly say which value to use for the NAs in the final numpy array (and the Series.numpy
can then forward such keyword).
That way, a user can do arr.to_numpy(dtype=object, na_value=None
) to get a numpy array with None instead of pd.NA, or arr.to_numpy(dtype=float, na_value=np.nan)
to get a float array with NaNs.
But even if we have that function (which I think we should), the above questions about the defaults are still to be answered (eg for __array__
we cannot have such a na_value
keyword, so we need to make a default choice).