API: how to handle NA in conversion to numpy arrays

In https://github.com/pandas-dev/pandas/pull/29964 and https://github.com/pandas-dev/pandas/pull/29961(using NA in IntegerArray and BooleanArray), the question comes up how to handle `pd.NA`'s in conversion to numpy arrays.

Such conversion occurs mainly in `__array__` (for `np.(as)array(..)`) and `.astype()`. For example:

```python
In [3]: arr = pd.array([1, 2, pd.NA], dtype="Int64")  

In [4]: np.asarray(arr) 
Out[4]: array([1, 2, None/pd.NA/..?], dtype=object)

In [5]: arr.astype(float)  
Out[5]: array([ 1.,  2., nan])  # <--- allow automatic NA to NaN conversion?
```

Questions that come up here:

- By default, when converting to object dtype, what "NA value" should be used? Before this was `NaN` or `None`, now it could logically be `pd.NA`. 
  A possible reason to choose None instead of pd.NA is that third party code that needs a numpy array will typically not be able to handle pd.NA while None is much more normal. On the other hand, there is also still time for such third party code to adapt. And it will probably be good to keep `list(arr)` (iteration/getitem) and `np.array(arr, dtype=object)` consisetnt.

- When converting to a float dtype, are we fine to automatically convert `pd.NA` to `np.nan` ? Or do we think the user should explicitly opt in for this?

We will probably want to add a `to_numpy` to those Integer/BooleanArray to be able to make those choices explicit, eg with following signature:

```
def to_numpy(self, dtype=object, na_value=...):
    ... 
```

where you can explicitly say which value to use for the NAs in the final numpy array (and the `Series.numpy` can then forward such keyword). 
That way, a user can do `arr.to_numpy(dtype=object, na_value=None`) to get a numpy array with None instead of pd.NA, or `arr.to_numpy(dtype=float, na_value=np.nan)` to get a float array with NaNs.

But even if we have that function (which I think we should), the above questions about the *defaults* are still to be answered (eg for `__array__` we cannot have such a `na_value` keyword, so we need to make a default choice).

cc @TomAugspurger @Dr-Irv 

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

API: how to handle NA in conversion to numpy arrays #30038

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

API: how to handle NA in conversion to numpy arrays #30038

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions