Missing Data

This issues is dedicated to discussing the large topic of "missing" data.

First, a bit on names. I think we can reasonably choose between `NA`, `null`, or `missing` as a general name for "missing" values. We'd use that to inform decisions on method names like `DataFrame.isna()` vs. `DataFrame.isnull()` vs. ...
Pandas favors `NA`, databases might favor `null`, Julia uses `missing`. I don't have a strong opinion here.

Some topics of discussion:

1. **data types should be nullable**

I think we'd like that the introduction of missing data should not fundamentally change the dtype of a column.
This is not the case with pandas:

```python
In [5]: df1 = pd.DataFrame({"A": ['a', 'b'], "B": [1, 2]})

In [6]: df2 = pd.DataFrame({"A": ['a', 'c'], "C": [3, 4]})

In [7]: df1.dtypes
Out[7]:
A    object
B     int64
dtype: object

In [8]: pd.merge(df1, df2, on="A", how="outer")
Out[8]:
   A    B    C
0  a  1.0  3.0
1  b  2.0  NaN
2  c  NaN  4.0

In [9]: _.dtypes
Out[9]:
A     object
B    float64
C    float64
```

In pandas, for int-dtype data `NaN` is used as the missing value indicator. `NaN` is a float, and so the column is cast to float64 dtype.

Ideally `Out[9]` would preserve the int dtype for `B` and `C`. At this moment, I don't have a strong opinion on whether the dtype for `B` should be a plain `int64`, or something like a `Union[int64, NA]`.

2. **Semantics in arithmetic and comparison operations**

In general, missing values should propagate in arithmetic and comparison operations (using `<NA>` as a marker for a missing value)`.

```
>>> df1 = DataFrame({"A": [1, None, 3]})
>>> df1 + 1
      A
0     2
1  <NA>
2     4

>>> df1 == 1
       A
0   True
1   <NA>
2  False
```

There might be a few exceptions. For example `0 ** NA` might be 1 rather than `NA`, since it doesn't matter exactly what value `NA` takes on.

3. **Semantics in logical operations**

For boolean logical operations (and, or, xor), libraries should implement three-value or [Kleene Logic][kleene]. The pandas docs has a [table][table]
The short-version is that the result should be `NA` if it depends on whether the `NA` operand being True or False. For example, `True | NA` is `True`, since it doesn't matter whether that `NA` is "really" True or False.

4. **The need for a scalar NA?**

Libraries might need to implement a scalar `NA` value, but I'm not sure. As a user, you would get this from indexing to get a scalar, or in an operation that produces an NA result.

```python
>>> df = pd.DataFrame({"A": [None]})
>>> df.iloc[0, 0]  # no comment on the indexing API
<NA>
```

What semantics should this scalar NA have? In particular, *should it be typed*? This is something we've struggled with in recent versions of pandas. There's a desire to preserve a property along the lines of the following


```python
(arr1 + arr2)[0].dtype == (arr1 + arr2[0]).dtype
```

Where the first value in the second array is `NA`. If you have a single `NA` without any dtype, you can't implement that property.
There's a long thread on this at https://github.com/pandas-dev/pandas/issues/28095.

[kleene]: https://en.wikipedia.org/wiki/Three-valued_logic#Kleene_and_Priest_logics
[table]: https://pandas.pydata.org/pandas-docs/dev/user_guide/boolean.html#kleene-logical-operations


Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Missing Data #9

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Missing Data #9

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions