API: any/all in context of boolean dtype with missing values

In the new missing values support, and especially while implementing the BooleanArray (https://github.com/pandas-dev/pandas/pull/29555/), the question comes up: what should `any` and `all` do in presence of missing values?

*edit from Tom: Here's a proposed table of behavior*

| case  | input                            | output |
| ----- | -------------------------------- | -------|
| 1.    | `all([True, NA], skipna=False)`  | NA     |
| 2.    | `all([False, NA], skipna=False)` | False  |
| 3.    | `all([NA], skipna=False)`        | NA     |
| 4.    | `all([], skipna=False)`          | True   |
| 5.    | `any([True, NA], skipna=False)`  | True   |
| 6.    | `any([False, NA], skipna=False)` | NA     |
| 7.    | `any([NA], skipna=False)`        | NA     |
| 8.    | `any([], skipna=False)`          | False  |
 
| case  | input                           | output |
| ----- | ------------------------------- | -------|
| 9.    | `all([True, NA], skipna=True)`  | True   |
| 10.   | `all([False, NA], skipna=True)` | False  |
| 11.   | `all([NA], skipna=True)`        | True   |
| 12.   | `all([], skipna=True)`          | True   |
| 13.   | `any([True, NA], skipna=True)`  | True   |
| 14.   | `any([False, NA], skipna=True)` | False  |
| 15.   | `any([NA], skipna=True)`        | False  |
| 16.   | `any([], skipna=True)`          | False  |


Some context:

Currently, if having bools with NaNs, you end up with a object dtype, and the behaviour of `any`/`all` with object dtype has all kinds of corner cases. @xhochy recently opened https://github.com/pandas-dev/pandas/issues/27709 for this (but opening a new issue since want to focus here the behaviour in boolean dtype, the behaviour in object dtype might still deviate)

The documentation of `any` says (https://dev.pandas.io/docs/reference/api/pandas.Series.any.html)

> Return whether any element is True, potentially over an axis.
>
>Returns False unless there at least one element within a series or along a Dataframe axis that is True or equivalent (e.g. non-zero or non-empty).
>
> ...
>
> skipna : bool, default True
>     Exclude NA/null values. If the entire row/column is NA and skipna is True, then the result will be False, as for an empty row/column. If skipna is False, then NA are treated as True, because these are not equal to zero.

and similar for `all` (https://dev.pandas.io/docs/reference/api/pandas.Series.all.html).

### Default behaviour with `skipna=True`

in case of some NA's and some True/False values, I think the behaviour is clear: `any`/`all` are reductions, and in pandas we use `skipna=True` for reductions. 

So you get something like  this: 
(I am still using `np.nan` here as missing value, since the pd.NA PR is not yet merged / combined with the BooleanArray PR; but let's focus on return value)

```
In [2]: pd.Series([True, False, np.nan]).any() 
Out[2]: True

In [3]: pd.Series([True, False, np.nan]).all()
Out[3]: False

In [4]: pd.Series([True, True, np.nan]).all() 
Out[4]: True
```

(although when interpreting NA as "unknown", it might look a bit strange to return True in the last case since the NA might still be True or False)

### Behaviour for all-NA in case of `skipna=True`

This is a case that is described in the current docs: "If the entire row/column is NA and skipna is True, then the result will be False, as for an empty row/column", and is indeed consistent with skipping all NAs -> any/all of empty set. 
And then, we follow numpy's behaviour (False for `any`, True for `all`):

```
In [8]: np.array([], dtype=bool).any() 
Out[8]: False

In [9]: np.array([], dtype=bool).all()
Out[9]: True
```

(although I don't find this necessarily very intuitive, this seems more a consequence of the algorithm starting with a base "identity" value of False/True for any/all)

### Behaviour with `skipna=False`

Here comes the more tricky part. Currently, with object dtype, we have some buggy behaviour (see https://github.com/pandas-dev/pandas/issues/27709), and it depends on the order of the values and which missing value (np.nan or None) is used.

With BooleanArray we won't have this problem (there is only a single NA + we don't need to rely on numpy's buggy object dtype behaviour). But I am not sure we should follow what is currently in the docs:

> If skipna is False, then NA are treated as True, because these are not equal to zero.

This follows from numpy's behaviour with floats:

```
In [10]: np.array([0, np.nan]).any()
Out[10]: True
```

and while this might make sense in float context, I am not sure we should follow this behaviour and our docs and do:

```
>>> pd.Series([False, pd.NA], dtype="boolean").any()
True
```

I think this should rather give False or NA instead of True. 
While for object dtype it might make sense to align the behaviour with float (as argued in https://github.com/pandas-dev/pandas/issues/27709#issuecomment-517703540), for a boolean dtype we can probably use the behaviour we defined for NA in logical operations (eg `False | NA = NA`, so in that case, the above should give NA). 
But are we ok with `any`/`all` not returning a boolean in this case? (note, you only have this if someone specifically set `skipna=False`)





Provide feedback

Saved searches

Use saved searches to filter your results more quickly

API: any/all in context of boolean dtype with missing values #29686

Default behaviour with `skipna=True`

Behaviour for all-NA in case of `skipna=True`

Behaviour with `skipna=False`

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

case	input	output
1.	`all([True, NA], skipna=False)`	NA
2.	`all([False, NA], skipna=False)`	False
3.	`all([NA], skipna=False)`	NA
4.	`all([], skipna=False)`	True
5.	`any([True, NA], skipna=False)`	True
6.	`any([False, NA], skipna=False)`	NA
7.	`any([NA], skipna=False)`	NA
8.	`any([], skipna=False)`	False

case	input	output
9.	`all([True, NA], skipna=True)`	True
10.	`all([False, NA], skipna=True)`	False
11.	`all([NA], skipna=True)`	True
12.	`all([], skipna=True)`	True
13.	`any([True, NA], skipna=True)`	True
14.	`any([False, NA], skipna=True)`	False
15.	`any([NA], skipna=True)`	False
16.	`any([], skipna=True)`	False

API: any/all in context of boolean dtype with missing values #29686

Description

Default behaviour with skipna=True

Behaviour for all-NA in case of skipna=True

Behaviour with skipna=False

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions

Default behaviour with `skipna=True`

Behaviour for all-NA in case of `skipna=True`

Behaviour with `skipna=False`