Description
In the new missing values support, and especially while implementing the BooleanArray (#29555), the question comes up: what should any
and all
do in presence of missing values?
edit from Tom: Here's a proposed table of behavior
case | input | output |
---|---|---|
1. | all([True, NA], skipna=False) |
NA |
2. | all([False, NA], skipna=False) |
False |
3. | all([NA], skipna=False) |
NA |
4. | all([], skipna=False) |
True |
5. | any([True, NA], skipna=False) |
True |
6. | any([False, NA], skipna=False) |
NA |
7. | any([NA], skipna=False) |
NA |
8. | any([], skipna=False) |
False |
case | input | output |
---|---|---|
9. | all([True, NA], skipna=True) |
True |
10. | all([False, NA], skipna=True) |
False |
11. | all([NA], skipna=True) |
True |
12. | all([], skipna=True) |
True |
13. | any([True, NA], skipna=True) |
True |
14. | any([False, NA], skipna=True) |
False |
15. | any([NA], skipna=True) |
False |
16. | any([], skipna=True) |
False |
Some context:
Currently, if having bools with NaNs, you end up with a object dtype, and the behaviour of any
/all
with object dtype has all kinds of corner cases. @xhochy recently opened #27709 for this (but opening a new issue since want to focus here the behaviour in boolean dtype, the behaviour in object dtype might still deviate)
The documentation of any
says (https://dev.pandas.io/docs/reference/api/pandas.Series.any.html)
Return whether any element is True, potentially over an axis.
Returns False unless there at least one element within a series or along a Dataframe axis that is True or equivalent (e.g. non-zero or non-empty).
...
skipna : bool, default True
Exclude NA/null values. If the entire row/column is NA and skipna is True, then the result will be False, as for an empty row/column. If skipna is False, then NA are treated as True, because these are not equal to zero.
and similar for all
(https://dev.pandas.io/docs/reference/api/pandas.Series.all.html).
Default behaviour with skipna=True
in case of some NA's and some True/False values, I think the behaviour is clear: any
/all
are reductions, and in pandas we use skipna=True
for reductions.
So you get something like this:
(I am still using np.nan
here as missing value, since the pd.NA PR is not yet merged / combined with the BooleanArray PR; but let's focus on return value)
In [2]: pd.Series([True, False, np.nan]).any()
Out[2]: True
In [3]: pd.Series([True, False, np.nan]).all()
Out[3]: False
In [4]: pd.Series([True, True, np.nan]).all()
Out[4]: True
(although when interpreting NA as "unknown", it might look a bit strange to return True in the last case since the NA might still be True or False)
Behaviour for all-NA in case of skipna=True
This is a case that is described in the current docs: "If the entire row/column is NA and skipna is True, then the result will be False, as for an empty row/column", and is indeed consistent with skipping all NAs -> any/all of empty set.
And then, we follow numpy's behaviour (False for any
, True for all
):
In [8]: np.array([], dtype=bool).any()
Out[8]: False
In [9]: np.array([], dtype=bool).all()
Out[9]: True
(although I don't find this necessarily very intuitive, this seems more a consequence of the algorithm starting with a base "identity" value of False/True for any/all)
Behaviour with skipna=False
Here comes the more tricky part. Currently, with object dtype, we have some buggy behaviour (see #27709), and it depends on the order of the values and which missing value (np.nan or None) is used.
With BooleanArray we won't have this problem (there is only a single NA + we don't need to rely on numpy's buggy object dtype behaviour). But I am not sure we should follow what is currently in the docs:
If skipna is False, then NA are treated as True, because these are not equal to zero.
This follows from numpy's behaviour with floats:
In [10]: np.array([0, np.nan]).any()
Out[10]: True
and while this might make sense in float context, I am not sure we should follow this behaviour and our docs and do:
>>> pd.Series([False, pd.NA], dtype="boolean").any()
True
I think this should rather give False or NA instead of True.
While for object dtype it might make sense to align the behaviour with float (as argued in #27709 (comment)), for a boolean dtype we can probably use the behaviour we defined for NA in logical operations (eg False | NA = NA
, so in that case, the above should give NA).
But are we ok with any
/all
not returning a boolean in this case? (note, you only have this if someone specifically set skipna=False
)