Skip to content

API: any/all in context of boolean dtype with missing values #29686

Closed
@jorisvandenbossche

Description

@jorisvandenbossche

In the new missing values support, and especially while implementing the BooleanArray (#29555), the question comes up: what should any and all do in presence of missing values?

edit from Tom: Here's a proposed table of behavior

case input output
1. all([True, NA], skipna=False) NA
2. all([False, NA], skipna=False) False
3. all([NA], skipna=False) NA
4. all([], skipna=False) True
5. any([True, NA], skipna=False) True
6. any([False, NA], skipna=False) NA
7. any([NA], skipna=False) NA
8. any([], skipna=False) False
case input output
9. all([True, NA], skipna=True) True
10. all([False, NA], skipna=True) False
11. all([NA], skipna=True) True
12. all([], skipna=True) True
13. any([True, NA], skipna=True) True
14. any([False, NA], skipna=True) False
15. any([NA], skipna=True) False
16. any([], skipna=True) False

Some context:

Currently, if having bools with NaNs, you end up with a object dtype, and the behaviour of any/all with object dtype has all kinds of corner cases. @xhochy recently opened #27709 for this (but opening a new issue since want to focus here the behaviour in boolean dtype, the behaviour in object dtype might still deviate)

The documentation of any says (https://dev.pandas.io/docs/reference/api/pandas.Series.any.html)

Return whether any element is True, potentially over an axis.

Returns False unless there at least one element within a series or along a Dataframe axis that is True or equivalent (e.g. non-zero or non-empty).

...

skipna : bool, default True
Exclude NA/null values. If the entire row/column is NA and skipna is True, then the result will be False, as for an empty row/column. If skipna is False, then NA are treated as True, because these are not equal to zero.

and similar for all (https://dev.pandas.io/docs/reference/api/pandas.Series.all.html).

Default behaviour with skipna=True

in case of some NA's and some True/False values, I think the behaviour is clear: any/all are reductions, and in pandas we use skipna=True for reductions.

So you get something like this:
(I am still using np.nan here as missing value, since the pd.NA PR is not yet merged / combined with the BooleanArray PR; but let's focus on return value)

In [2]: pd.Series([True, False, np.nan]).any() 
Out[2]: True

In [3]: pd.Series([True, False, np.nan]).all()
Out[3]: False

In [4]: pd.Series([True, True, np.nan]).all() 
Out[4]: True

(although when interpreting NA as "unknown", it might look a bit strange to return True in the last case since the NA might still be True or False)

Behaviour for all-NA in case of skipna=True

This is a case that is described in the current docs: "If the entire row/column is NA and skipna is True, then the result will be False, as for an empty row/column", and is indeed consistent with skipping all NAs -> any/all of empty set.
And then, we follow numpy's behaviour (False for any, True for all):

In [8]: np.array([], dtype=bool).any() 
Out[8]: False

In [9]: np.array([], dtype=bool).all()
Out[9]: True

(although I don't find this necessarily very intuitive, this seems more a consequence of the algorithm starting with a base "identity" value of False/True for any/all)

Behaviour with skipna=False

Here comes the more tricky part. Currently, with object dtype, we have some buggy behaviour (see #27709), and it depends on the order of the values and which missing value (np.nan or None) is used.

With BooleanArray we won't have this problem (there is only a single NA + we don't need to rely on numpy's buggy object dtype behaviour). But I am not sure we should follow what is currently in the docs:

If skipna is False, then NA are treated as True, because these are not equal to zero.

This follows from numpy's behaviour with floats:

In [10]: np.array([0, np.nan]).any()
Out[10]: True

and while this might make sense in float context, I am not sure we should follow this behaviour and our docs and do:

>>> pd.Series([False, pd.NA], dtype="boolean").any()
True

I think this should rather give False or NA instead of True.
While for object dtype it might make sense to align the behaviour with float (as argued in #27709 (comment)), for a boolean dtype we can probably use the behaviour we defined for NA in logical operations (eg False | NA = NA, so in that case, the above should give NA).
But are we ok with any/all not returning a boolean in this case? (note, you only have this if someone specifically set skipna=False)

Metadata

Metadata

Assignees

No one assigned

    Labels

    ExtensionArrayExtending pandas with custom dtypes or arrays.Missing-datanp.nan, pd.NaT, pd.NA, dropna, isnull, interpolateNeeds DiscussionRequires discussion from core team before further action

    Type

    No type

    Projects

    No projects

    Milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions