Description
-
I have checked that this issue has not already been reported.
-
I have confirmed this bug exists on the latest version of pandas.
-
(optional) I have confirmed this bug exists on the master branch of pandas.
This is an issue I came across while working on #40127.
Most Index objects can be constructed with None values:
pd.Index(['a', 'b', None], dtype='category')
pd.Index([1.0, 2.0, None], dtype='float64')
pd.Index(['2000-01', '2000-02', None], dtype='datetime64[ns]')
But (U)Int64Index and cannot accept None values:
>>> pd.Index([1, 2, None], dtype='int64')
TypeError: int() argument must be a string, a bytes-like object or a number, not 'NoneType'
This seems pretty reasonable. However, inside of a MultiIndex, None values and the int dtype can actually coexist:
>>> pd.MultiIndex.from_arrays([[1, 2, None]]).levels[0]
Int64Index([1, 2], dtype='int64')
So if we construct a DataFrame like so:
>>> df = pd.DataFrame([10, 20, 30], index=pd.MultiIndex.from_arrays([[1, 2, None]]))
0
1 10
2 20
NaN 30
Indeed, the dtype of the first and only level is correct:
>>> df.index.levels[0]
Int64Index([1, 2], dtype='int64')
The reason this works is because of codes
, which can encode None values as -1
.
>>> df.index.codes
FrozenList([[0, 1, -1]])
More examples of None behavior in various scenarios:
>>> df = pd.DataFrame(np.zeros((3, 3)), columns=pd.MultiIndex.from_arrays([['a', 'b', 'c'], [1, 2, None]]))
a b c
1 2 NaN
0 0.0 0.0 0.0
1 0.0 0.0 0.0
2 0.0 0.0 0.0
>>> df.stack(0).columns
Int64Index([1, 2], dtype='int64')
>>> df.columns.droplevel(0)
Float64Index([1.0, 2.0, nan], dtype='float64')
>>> pd.Index([1, 2, None])
Index([1, 2, None], dtype='object')
I'm not sure what the best solution is to normalize the treatment of Nones inside indices. So I thought I would raise this issue and see what others think about this.