Closed
Description
Referencing the comments in #19481, right now rank operations performed against objects have a few issues, namely that they:
- Are inherently ambiguous, relying on lexical encoding AND
- Are not consistent across Series, DataFrame and GroupBy objects with various arguments
To illustrate the latter:
In [1]: vals = ['apple', 'orange', 'banana']
In [2] pd.Series(vals).rank() # this will "work"
Out[8]:
0 1.0
1 3.0
2 2.0
dtype: float64
In [3]: pd.Series(vals).rank(method='first') # raises
ValueError: first not supported for non-numeric data
In [4]: pd.DataFrame({'key': ['foo'] * 3, 'vals': vals}).groupby('key').rank(method='first') # should raise?
Out[4]:
Empty DataFrame
Columns: []
Index: []
(see also #19482)
With this change I'd propose that we simply raise ValueError consistently for rank against object
dtypes regardless of which type of object performs the transformation and regardless of arguments.
One known caveat is that Categorical
types currently use the rank_object
methods in algos
. My assumption is that we would want to continue supporting ranking for ordered Categoricals but raise for unordered Categoricals.