Skip to content

Address polymorphic return value for unique? #212

Closed
@rgommers

Description

@rgommers

unique was added very early on in gh-25. The review discussions have focused on

There is another issue that was not discussed: unique is the only function whose return type is polymorphic. It can return an array, or a length-2, 3 or 4 tuple of arrays:

unique(x, /, *, return_counts=False, return_index=False, return_inverse=False)

API doc: https://data-apis.org/array-api/latest/API_specification/set_functions.html#unique-x-return-counts-false-return-index-false-return-inverse-false

This issue is painful for, e.g., libraries that have to support it in their JIT compiler. For svd, which previously was polymorphic too, we decided that it was better to introduce svdvals to get rid of the polymorphism. There's even a design principle that says as much in https://data-apis.org/array-api/latest/extensions/linear_algebra_functions.html: "In general, interfaces should avoid polymorphic return values (e.g., returning an array or a namedtuple, dependent on, e.g., an optional keyword argument)."

Given that there are 3 boolean keywords, it's not feasible to split unique into separate functions (there'd be 8 of them). An alternative may be:

  • return a custom object with four named fields: (unique, indices, inverse, counts).
  • no tuple unpacking
  • if data is returned, its type is array
  • for fields where the boolean keyword was False, accessing the field is undefined behavior

Another question that may be relevant is: how often are the boolean keywords each used? The data at https://raw.githubusercontent.com/data-apis/python-record-api/master/data/typing/numpy.py may help there:

@overload
def unique(ar: numpy.ndarray, return_inverse: bool):
    """
    usage.scipy: 14
    usage.skimage: 5
    usage.sklearn: 125
    usage.statsmodels: 18
    usage.xarray: 3
    """
    ...


@overload
def unique(ar: numpy.ndarray):
    """
    usage.dask: 7
    usage.matplotlib: 6
    usage.orange3: 16
    usage.scipy: 39
    usage.seaborn: 5
    usage.skimage: 48
    usage.sklearn: 385
    usage.statsmodels: 45
    usage.xarray: 10
    """
    ...


@overload
def unique(ar: numpy.ndarray, return_inverse: bool, return_counts: bool):
    """
    usage.skimage: 3
    usage.sklearn: 2
    usage.statsmodels: 1
    """
    ...


@overload
def unique(ar: numpy.ndarray, return_counts: bool):
    """
    usage.orange3: 3
    usage.scipy: 5
    usage.skimage: 8
    usage.sklearn: 1
    usage.statsmodels: 1
    """
    ...


@overload
def unique(ar: numpy.ndarray, return_index: bool):
    """
    usage.scipy: 5
    usage.skimage: 4
    usage.sklearn: 1
    usage.statsmodels: 1
    """
    ...


@overload
def unique(ar: List[int], return_counts: bool):
    """
    usage.orange3: 1
    """
    ...


@overload
def unique(ar: xarray.core.dataarray.DataArray):
    """
    usage.xarray: 2
    """
    ...


@overload
def unique(ar: List[numpy.int32]):
    """
    usage.statsmodels: 1
    """
    ...


@overload
def unique(ar: List[Literal["b", "a"]]):
    """
    usage.sklearn: 2
    usage.statsmodels: 1
    """
    ...


@overload
def unique(ar: numpy.ndarray, return_index: bool, return_inverse: bool):
    """
    usage.matplotlib: 1
    usage.sklearn: 6
    usage.statsmodels: 3
    """
    ...


@overload
def unique(ar: List[Literal["4"]], return_inverse: bool):
    """
    usage.statsmodels: 1
    """
    ...


@overload
def unique(
    ar: Union[
        pandas.core.series.Series,
        numpy.ndarray,
        pandas.core.arrays.categorical.Categorical,
        List[str],
    ],
    return_inverse: bool = ...,
    return_index: bool = ...,
):
    """
    usage.pandas: 22
    """
    ...


@overload
def unique(ar: List[float]):
    """
    usage.scipy: 2
    usage.sklearn: 3
    """
    ...


@overload
def unique(ar: numpy.ndarray, axis: int):
    """
    usage.scipy: 1
    """
    ...


@overload
def unique(ar: pandas.core.series.Series):
    """
    usage.dask: 3
    usage.geopandas: 1
    usage.prophet: 4
    usage.seaborn: 4
    usage.sklearn: 1
    """
    ...


@overload
def unique(ar: numpy.ma.core.MaskedArray):
    """
    usage.matplotlib: 1
    """
    ...


@overload
def unique(ar: List[int]):
    """
    usage.seaborn: 2
    usage.sklearn: 26
    """
    ...


@overload
def unique(ar: List[numpy.int64]):
    """
    usage.dask: 4
    """
    ...


@overload
def unique(ar: List[numpy.float64]):
    """
    usage.dask: 5
    usage.sklearn: 2
    """
    ...


@overload
def unique(ar: List[numpy.complex128]):
    """
    usage.dask: 3
    """
    ...


@overload
def unique(ar: List[numpy.bool_]):
    """
    usage.dask: 3
    """
    ...


@overload
def unique(ar: List[numpy.float32]):
    """
    usage.dask: 2
    """
    ...


@overload
def unique(
    ar: numpy.ndarray, return_index: bool, return_inverse: bool, return_counts: bool
):
    """
    usage.dask: 2
    """
    ...


@overload
def unique(ar: numpy.memmap):
    """
    usage.sklearn: 2
    """
    ...


@overload
def unique(ar: List[Literal["copyright", "beer", "pizza", "the"]]):
    """
    usage.sklearn: 2
    """
    ...


@overload
def unique(ar: List[Literal["copyright", "beer", "burger", "pizza", "the"]]):
    """
    usage.sklearn: 1
    """
    ...


@overload
def unique(ar: List[Literal["copyright", "beer", "burger", "the"]]):
    """
    usage.sklearn: 1
    """
    ...


@overload
def unique(ar: List[Literal["copyright", "coke", "burger", "the"]]):
    """
    usage.sklearn: 1
    """
    ...


@overload
def unique(ar: List[Literal["burger", "coke", "the"]]):
    """
    usage.sklearn: 1
    """
    ...


@overload
def unique(ar: List[Literal["copyright", "celeri", "salad", "the"]]):
    """
    usage.sklearn: 1
    """
    ...


@overload
def unique(ar: List[Literal["copyright", "water", "sparkling", "salad", "the"]]):
    """
    usage.sklearn: 1
    """
    ...


@overload
def unique(ar: List[Literal["copyright", "celeri", "the"]]):
    """
    usage.sklearn: 1
    """
    ...


@overload
def unique(ar: List[Literal["water", "salad", "tomato", "the"]]):
    """
    usage.sklearn: 1
    """
    ...


@overload
def unique(ar: List[Literal["copyright", "water", "salad", "tomato", "the"]]):
    """
    usage.sklearn: 1
    """
    ...


@overload
def unique(ar: List[Literal["three", "two", "one"]]):
    """
    usage.sklearn: 3
    """
    ...


@overload
def unique(ar: List[Union[float, int]], return_inverse: bool):
    """
    usage.sklearn: 1
    """
    ...


@overload
def unique(ar: List[Literal["b", "c", "a"]]):
    """
    usage.sklearn: 1
    """
    ...


@overload
def unique(ar: List[Literal["3", "2", "1"]]):
    """
    usage.sklearn: 2
    """
    ...


def unique(
    ar: object,
    return_index: bool = ...,
    return_inverse: bool = ...,
    return_counts: bool = ...,
):
    """
    usage.dask: 29
    usage.geopandas: 1
    usage.matplotlib: 8
    usage.orange3: 20
    usage.pandas: 22
    usage.prophet: 4
    usage.scipy: 66
    usage.seaborn: 11
    usage.skimage: 68
    usage.sklearn: 574
    usage.statsmodels: 72
    usage.xarray: 15

Conclusion: a large majority of usage is unique(x), without any keywords. return_inverse is fairly often used, return_index almost never (checked by searching Pandas, sklearn et al.).

So another alternative count be to include two separate functions:

  1. unique(x, /) - returns array of unique values
  2. unique_all(x, /) - returns tuple of 4 arrays (may need a better name)

Metadata

Metadata

Assignees

No one assigned

    Labels

    API changeChanges to existing functions or objects in the API.

    Type

    No type

    Projects

    No projects

    Milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions