Skip to content

API/DOC: Deprecate and Advise against having np.nan in Categoricals #10748

Closed
@TomAugspurger

Description

@TomAugspurger

This came out of work on #10729

In the documentation, we mention that

There are two ways a np.nan can be represented in categorical data: either the value is not available (“missing value”) or np.nan is a valid category.

In the first case, NaN is not in .categories, and in the second case it is. I think we should only
recommend the first.

The option of NaNs in the categories makes the code in #10729 less pleasant that it would be otherwise. I don't think we should error if NaNs are included, just advise against it in the docs. Perhaps a deprecation, but I worry that I'm missing some obvious reason why NaNs were allowed in .categories.

@JanSchulz do you remember the initial reason for allowing either representation?

Some bad things that come out of NaN in .categories:

  • Can't rely on a value of nan mapping to a code of -1:
In [2]: s = pd.Categorical(['a', 'b', 'a', np.nan], categories=['a', 'b', np.nan])

In [3]: s
Out[3]:
[a, b, a, NaN]
Categories (3, object): [a, b, NaN]

In [4]: s.categories
Out[4]: Index(['a', 'b', nan], dtype='object')

In [5]: s.codes
Out[5]: array([0, 1, 0, 2], dtype=int8)
  • potentially have to upcast the index type or mix strings and floats (nan) in the .categories Index.
  • extra code if you want to generically handle Categoricals that may or may not have NaN in categories.

Metadata

Metadata

Assignees

No one assigned

    Labels

    Type

    No type

    Projects

    No projects

    Milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions