Skip to content

Missing Data #9

Open
Open
@TomAugspurger

Description

@TomAugspurger

This issues is dedicated to discussing the large topic of "missing" data.

First, a bit on names. I think we can reasonably choose between NA, null, or missing as a general name for "missing" values. We'd use that to inform decisions on method names like DataFrame.isna() vs. DataFrame.isnull() vs. ...
Pandas favors NA, databases might favor null, Julia uses missing. I don't have a strong opinion here.

Some topics of discussion:

  1. data types should be nullable

I think we'd like that the introduction of missing data should not fundamentally change the dtype of a column.
This is not the case with pandas:

In [5]: df1 = pd.DataFrame({"A": ['a', 'b'], "B": [1, 2]})

In [6]: df2 = pd.DataFrame({"A": ['a', 'c'], "C": [3, 4]})

In [7]: df1.dtypes
Out[7]:
A    object
B     int64
dtype: object

In [8]: pd.merge(df1, df2, on="A", how="outer")
Out[8]:
   A    B    C
0  a  1.0  3.0
1  b  2.0  NaN
2  c  NaN  4.0

In [9]: _.dtypes
Out[9]:
A     object
B    float64
C    float64

In pandas, for int-dtype data NaN is used as the missing value indicator. NaN is a float, and so the column is cast to float64 dtype.

Ideally Out[9] would preserve the int dtype for B and C. At this moment, I don't have a strong opinion on whether the dtype for B should be a plain int64, or something like a Union[int64, NA].

  1. Semantics in arithmetic and comparison operations

In general, missing values should propagate in arithmetic and comparison operations (using <NA> as a marker for a missing value)`.

>>> df1 = DataFrame({"A": [1, None, 3]})
>>> df1 + 1
      A
0     2
1  <NA>
2     4

>>> df1 == 1
       A
0   True
1   <NA>
2  False

There might be a few exceptions. For example 0 ** NA might be 1 rather than NA, since it doesn't matter exactly what value NA takes on.

  1. Semantics in logical operations

For boolean logical operations (and, or, xor), libraries should implement three-value or Kleene Logic. The pandas docs has a table
The short-version is that the result should be NA if it depends on whether the NA operand being True or False. For example, True | NA is True, since it doesn't matter whether that NA is "really" True or False.

  1. The need for a scalar NA?

Libraries might need to implement a scalar NA value, but I'm not sure. As a user, you would get this from indexing to get a scalar, or in an operation that produces an NA result.

>>> df = pd.DataFrame({"A": [None]})
>>> df.iloc[0, 0]  # no comment on the indexing API
<NA>

What semantics should this scalar NA have? In particular, should it be typed? This is something we've struggled with in recent versions of pandas. There's a desire to preserve a property along the lines of the following

(arr1 + arr2)[0].dtype == (arr1 + arr2[0]).dtype

Where the first value in the second array is NA. If you have a single NA without any dtype, you can't implement that property.
There's a long thread on this at pandas-dev/pandas#28095.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions