Description
Updated with to do list:
- Implement
pd.NA
scalar -> ENH: add NA scalar for missing value indicator, use in StringArray. #29597 - Basic BooleanArray -> ENH: add BooleanArray extension array #29555
- Use
pd.NA
in BooleanArray -> Use new NA scalar in BooleanArray #29961- Implement kleene-logic in logical ops on BooleanArray -> ENH: Implement Kleene logic for BooleanArray #29842
- Update the behaviour of
any
/all
reductions withskipna=False
(API: any/all in context of boolean dtype with missing values #29686) -> API: BooleanArray any/all with NA logic #30062
- Use BooleanArray in comparison ops of StringArray -> StringArray comparisions return BooleanArray #30231
- Use
pd.NA
in IntegerArray -> API: Uses pd.NA in IntegerArray #29964 - Use BooleanArray as the return value for logical ops in IntegerArray -> API: Uses pd.NA in IntegerArray #29964
- Enable boolean indexing with BooleanArray ( DISCUSS: boolean dtype with missing value support #28778) -> DOC/TST: Indexing with NA raises #30308
- Use BooleanArray as the return value for boolean
.str
methods. -> API: Return BoolArray for string ops when backed by StringArray #30239 - Implement
NA.__array_ufunc__
-> Implement NA.__array_ufunc__ #30245 - Base class for IntegerArray & BooleanArray -> REF: Implement BaseMaskedArray class for integer/boolean ExtensionArrays #30789
- Ensure everything is properly documented
Original issue:
Issue to discuss the implementation strategy for #28095.
Opening a new issue, as the other one already has a lot of discussion in several discussion, and would propose to keep this one focused on the practical aspects of how to implement this (regardless of certain aspects of the NA proposal such as single NA vs dtype-specific NAs -> for that will post a summary of the discussion on #28095 tomorrow).
I would like to propose the following way forward:
On the short term (ideally for 1.0):
- Already implement and provide the
pd.NA
scalar, and recognize it in the appropriate places as missing value (e.g.pd.isna
). This way, it can already be used in external ExtentionArrays - Implement a
BooleanArray
with support for missing values and appropriate NA behaviour. To start, we can just use a numpy masked array approach (similar to the existing IntegerArray), not involving any pyarrow memory optimizations. - Start using this BooleanArray as the boolean result of comparison operations for IntegerArray/StringArray (breaking change for nullable integers)
- Other arrays will keep using the numpy bool, this means we have two "boolean" dtypes side by side with different behaviour, and which one you get depends on the original data type (potentially confusing for users)
- Start using
pd.NA
as the missing value indicator for Integer/String/BooleanArray (breaking change for nullable integers)
On the intermediate term (after 1.0)
- Investigate if it can be implemented optionally for other data types and "activated" to have users opt-in for existing dtypes (to be further thought out).
I think the main discussion point is if we are OK with such a breaking change for IntegerArray.
I would personally do this: IntegerArray was only introduced recently, still regarded as experimental, and the perfect use case for those changes. But, it's certainly a clear backwards incompatible, breaking change.
cc @pandas-dev/pandas-core