Skip to content

ENH: add mask-aware implementation of factorize algos #30037

Closed
@jorisvandenbossche

Description

@jorisvandenbossche

Now we start to have mask-based dtypes/arrays (integer, boolean), we should also look into making our algos work with such masked arrays. An example for which we could explore this is factorize / unique.

Currently, BooleanArray and IntegerArray need to convert their masked array into a single numpy array using a certain "NA sentinel" that is specified so the algo can recognize this sentinel. This happens through the ExtensionArray._values_for_factorize, which returns a (numpy array, NA sentinel) tuple.
In practice this means that the boolean array is converted to integer (with NA as -1), and IntegerArray is converted to float array with NA as NaN, so the algos can handle this.

We should look into:

  • Can we adapt or make a specific version of the unique/factorize hashtable class that takes a mask instead of a NA sentinel
  • We could then have a variant of ExtensionArray._values_for_factorize that then returns (array, mask) instead of (array, NA).

Metadata

Metadata

Assignees

No one assigned

    Labels

    AlgosNon-arithmetic algos: value_counts, factorize, sorting, isin, clip, shift, diffEnhancementExtensionArrayExtending pandas with custom dtypes or arrays.NA - MaskedArraysRelated to pd.NA and nullable extension arrays

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions