Description
unique
was added very early on in gh-25. The review discussions have focused on
- whether return values should be sorted or not (see Specifying a sort order when returning unique values #40), and
- (b) the issue with value-dependent output shape (see Boolean array indexing is not compatible with static memory allocation #84).
There is another issue that was not discussed: unique
is the only function whose return type is polymorphic. It can return an array, or a length-2, 3 or 4 tuple of arrays:
unique(x, /, *, return_counts=False, return_index=False, return_inverse=False)
This issue is painful for, e.g., libraries that have to support it in their JIT compiler. For svd
, which previously was polymorphic too, we decided that it was better to introduce svdvals
to get rid of the polymorphism. There's even a design principle that says as much in https://data-apis.org/array-api/latest/extensions/linear_algebra_functions.html: "In general, interfaces should avoid polymorphic return values (e.g., returning an array or a namedtuple, dependent on, e.g., an optional keyword argument)."
Given that there are 3 boolean keywords, it's not feasible to split unique
into separate functions (there'd be 8 of them). An alternative may be:
- return a custom object with four named fields: (
unique
,indices
,inverse
,counts
). - no tuple unpacking
- if data is returned, its type is array
- for fields where the boolean keyword was
False
, accessing the field is undefined behavior
Another question that may be relevant is: how often are the boolean keywords each used? The data at https://raw.githubusercontent.com/data-apis/python-record-api/master/data/typing/numpy.py may help there:
@overload
def unique(ar: numpy.ndarray, return_inverse: bool):
"""
usage.scipy: 14
usage.skimage: 5
usage.sklearn: 125
usage.statsmodels: 18
usage.xarray: 3
"""
...
@overload
def unique(ar: numpy.ndarray):
"""
usage.dask: 7
usage.matplotlib: 6
usage.orange3: 16
usage.scipy: 39
usage.seaborn: 5
usage.skimage: 48
usage.sklearn: 385
usage.statsmodels: 45
usage.xarray: 10
"""
...
@overload
def unique(ar: numpy.ndarray, return_inverse: bool, return_counts: bool):
"""
usage.skimage: 3
usage.sklearn: 2
usage.statsmodels: 1
"""
...
@overload
def unique(ar: numpy.ndarray, return_counts: bool):
"""
usage.orange3: 3
usage.scipy: 5
usage.skimage: 8
usage.sklearn: 1
usage.statsmodels: 1
"""
...
@overload
def unique(ar: numpy.ndarray, return_index: bool):
"""
usage.scipy: 5
usage.skimage: 4
usage.sklearn: 1
usage.statsmodels: 1
"""
...
@overload
def unique(ar: List[int], return_counts: bool):
"""
usage.orange3: 1
"""
...
@overload
def unique(ar: xarray.core.dataarray.DataArray):
"""
usage.xarray: 2
"""
...
@overload
def unique(ar: List[numpy.int32]):
"""
usage.statsmodels: 1
"""
...
@overload
def unique(ar: List[Literal["b", "a"]]):
"""
usage.sklearn: 2
usage.statsmodels: 1
"""
...
@overload
def unique(ar: numpy.ndarray, return_index: bool, return_inverse: bool):
"""
usage.matplotlib: 1
usage.sklearn: 6
usage.statsmodels: 3
"""
...
@overload
def unique(ar: List[Literal["4"]], return_inverse: bool):
"""
usage.statsmodels: 1
"""
...
@overload
def unique(
ar: Union[
pandas.core.series.Series,
numpy.ndarray,
pandas.core.arrays.categorical.Categorical,
List[str],
],
return_inverse: bool = ...,
return_index: bool = ...,
):
"""
usage.pandas: 22
"""
...
@overload
def unique(ar: List[float]):
"""
usage.scipy: 2
usage.sklearn: 3
"""
...
@overload
def unique(ar: numpy.ndarray, axis: int):
"""
usage.scipy: 1
"""
...
@overload
def unique(ar: pandas.core.series.Series):
"""
usage.dask: 3
usage.geopandas: 1
usage.prophet: 4
usage.seaborn: 4
usage.sklearn: 1
"""
...
@overload
def unique(ar: numpy.ma.core.MaskedArray):
"""
usage.matplotlib: 1
"""
...
@overload
def unique(ar: List[int]):
"""
usage.seaborn: 2
usage.sklearn: 26
"""
...
@overload
def unique(ar: List[numpy.int64]):
"""
usage.dask: 4
"""
...
@overload
def unique(ar: List[numpy.float64]):
"""
usage.dask: 5
usage.sklearn: 2
"""
...
@overload
def unique(ar: List[numpy.complex128]):
"""
usage.dask: 3
"""
...
@overload
def unique(ar: List[numpy.bool_]):
"""
usage.dask: 3
"""
...
@overload
def unique(ar: List[numpy.float32]):
"""
usage.dask: 2
"""
...
@overload
def unique(
ar: numpy.ndarray, return_index: bool, return_inverse: bool, return_counts: bool
):
"""
usage.dask: 2
"""
...
@overload
def unique(ar: numpy.memmap):
"""
usage.sklearn: 2
"""
...
@overload
def unique(ar: List[Literal["copyright", "beer", "pizza", "the"]]):
"""
usage.sklearn: 2
"""
...
@overload
def unique(ar: List[Literal["copyright", "beer", "burger", "pizza", "the"]]):
"""
usage.sklearn: 1
"""
...
@overload
def unique(ar: List[Literal["copyright", "beer", "burger", "the"]]):
"""
usage.sklearn: 1
"""
...
@overload
def unique(ar: List[Literal["copyright", "coke", "burger", "the"]]):
"""
usage.sklearn: 1
"""
...
@overload
def unique(ar: List[Literal["burger", "coke", "the"]]):
"""
usage.sklearn: 1
"""
...
@overload
def unique(ar: List[Literal["copyright", "celeri", "salad", "the"]]):
"""
usage.sklearn: 1
"""
...
@overload
def unique(ar: List[Literal["copyright", "water", "sparkling", "salad", "the"]]):
"""
usage.sklearn: 1
"""
...
@overload
def unique(ar: List[Literal["copyright", "celeri", "the"]]):
"""
usage.sklearn: 1
"""
...
@overload
def unique(ar: List[Literal["water", "salad", "tomato", "the"]]):
"""
usage.sklearn: 1
"""
...
@overload
def unique(ar: List[Literal["copyright", "water", "salad", "tomato", "the"]]):
"""
usage.sklearn: 1
"""
...
@overload
def unique(ar: List[Literal["three", "two", "one"]]):
"""
usage.sklearn: 3
"""
...
@overload
def unique(ar: List[Union[float, int]], return_inverse: bool):
"""
usage.sklearn: 1
"""
...
@overload
def unique(ar: List[Literal["b", "c", "a"]]):
"""
usage.sklearn: 1
"""
...
@overload
def unique(ar: List[Literal["3", "2", "1"]]):
"""
usage.sklearn: 2
"""
...
def unique(
ar: object,
return_index: bool = ...,
return_inverse: bool = ...,
return_counts: bool = ...,
):
"""
usage.dask: 29
usage.geopandas: 1
usage.matplotlib: 8
usage.orange3: 20
usage.pandas: 22
usage.prophet: 4
usage.scipy: 66
usage.seaborn: 11
usage.skimage: 68
usage.sklearn: 574
usage.statsmodels: 72
usage.xarray: 15
Conclusion: a large majority of usage is unique(x)
, without any keywords. return_inverse
is fairly often used, return_index
almost never (checked by searching Pandas, sklearn et al.).
So another alternative count be to include two separate functions:
unique(x, /)
- returns array of unique valuesunique_all(x, /)
- returns tuple of 4 arrays (may need a better name)