Skip to content

Raise ValueError When Attempting to Rank Object Dtypes #19560

Closed
@WillAyd

Description

@WillAyd

Referencing the comments in #19481, right now rank operations performed against objects have a few issues, namely that they:

  • Are inherently ambiguous, relying on lexical encoding AND
  • Are not consistent across Series, DataFrame and GroupBy objects with various arguments

To illustrate the latter:

In [1]: vals = ['apple', 'orange', 'banana']
In [2]  pd.Series(vals).rank()  # this will "work"
Out[8]: 
0    1.0
1    3.0
2    2.0
dtype: float64

In [3]: pd.Series(vals).rank(method='first')  # raises
ValueError: first not supported for non-numeric data

In [4]: pd.DataFrame({'key': ['foo'] * 3, 'vals': vals}).groupby('key').rank(method='first')  # should raise?
Out[4]: 
Empty DataFrame
Columns: []
Index: []

(see also #19482)

With this change I'd propose that we simply raise ValueError consistently for rank against object dtypes regardless of which type of object performs the transformation and regardless of arguments.

One known caveat is that Categorical types currently use the rank_object methods in algos. My assumption is that we would want to continue supporting ranking for ordered Categoricals but raise for unordered Categoricals.

Metadata

Metadata

Assignees

No one assigned

    Type

    No type

    Projects

    No projects

    Milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions