Description
This issues is dedicated to discussing the large topic of "missing" data.
First, a bit on names. I think we can reasonably choose between NA
, null
, or missing
as a general name for "missing" values. We'd use that to inform decisions on method names like DataFrame.isna()
vs. DataFrame.isnull()
vs. ...
Pandas favors NA
, databases might favor null
, Julia uses missing
. I don't have a strong opinion here.
Some topics of discussion:
- data types should be nullable
I think we'd like that the introduction of missing data should not fundamentally change the dtype of a column.
This is not the case with pandas:
In [5]: df1 = pd.DataFrame({"A": ['a', 'b'], "B": [1, 2]})
In [6]: df2 = pd.DataFrame({"A": ['a', 'c'], "C": [3, 4]})
In [7]: df1.dtypes
Out[7]:
A object
B int64
dtype: object
In [8]: pd.merge(df1, df2, on="A", how="outer")
Out[8]:
A B C
0 a 1.0 3.0
1 b 2.0 NaN
2 c NaN 4.0
In [9]: _.dtypes
Out[9]:
A object
B float64
C float64
In pandas, for int-dtype data NaN
is used as the missing value indicator. NaN
is a float, and so the column is cast to float64 dtype.
Ideally Out[9]
would preserve the int dtype for B
and C
. At this moment, I don't have a strong opinion on whether the dtype for B
should be a plain int64
, or something like a Union[int64, NA]
.
- Semantics in arithmetic and comparison operations
In general, missing values should propagate in arithmetic and comparison operations (using <NA>
as a marker for a missing value)`.
>>> df1 = DataFrame({"A": [1, None, 3]})
>>> df1 + 1
A
0 2
1 <NA>
2 4
>>> df1 == 1
A
0 True
1 <NA>
2 False
There might be a few exceptions. For example 0 ** NA
might be 1 rather than NA
, since it doesn't matter exactly what value NA
takes on.
- Semantics in logical operations
For boolean logical operations (and, or, xor), libraries should implement three-value or Kleene Logic. The pandas docs has a table
The short-version is that the result should be NA
if it depends on whether the NA
operand being True or False. For example, True | NA
is True
, since it doesn't matter whether that NA
is "really" True or False.
- The need for a scalar NA?
Libraries might need to implement a scalar NA
value, but I'm not sure. As a user, you would get this from indexing to get a scalar, or in an operation that produces an NA result.
>>> df = pd.DataFrame({"A": [None]})
>>> df.iloc[0, 0] # no comment on the indexing API
<NA>
What semantics should this scalar NA have? In particular, should it be typed? This is something we've struggled with in recent versions of pandas. There's a desire to preserve a property along the lines of the following
(arr1 + arr2)[0].dtype == (arr1 + arr2[0]).dtype
Where the first value in the second array is NA
. If you have a single NA
without any dtype, you can't implement that property.
There's a long thread on this at pandas-dev/pandas#28095.