API: unified sorting

originally #5190
xref #9816
xref #3942 

This issue is for creating a unified API to Series & DataFrame sorting methods. Panels are not addressed (yet) but a unified API should be easy to extend to them. Related are #2094, #5190, #6847, #7121, #2615.  As discussion proceeds, this post will be edited. 

For reference, the 0.14.1 signatures are:

``` python
Series.sort(axis=0, ascending=True, kind='quicksort', na_position='last', inplace=True)
Series.sort_index(ascending=True)
Series.sortlevel(level=0, ascending=True, sort_remaining=True)

DataFrame.sort(columns=None, axis=0, ascending=True, inplace=False, kind='quicksort', 
               na_position='last')
DataFrame.sort_index(axis=0, by=None, ascending=True, inplace=False, kind='quicksort', 
                     na_position='last')
DataFrame.sortlevel(level=0, axis=0, ascending=True, inplace=False, sort_remaining=True)
```

Proposed unified signature for `Series.sort` and `DataFrame.sort` (except Series version retains current inplace=True):

``` python
def sort(self, by=None, axis=0, level=None, ascending=True, inplace=False, 
         kind='quicksort', na_position='last', sort_remaining=True):
         """Sort by labels (along either axis), by the values in column(s) or both.

         If both, labels take precedence over columns. If neither is specified,
         behavior is object-dependent: series = on values, dataframe = on index.

         Parameters
         ----------
         by : column name or list of column names
             if not None, sort on values in specified column name; perform nested
             sort if list of column names specified. this argument ignored by series
         axis : {0, 1}
             sort index/rows (0) or columns (1); for Series, only 0 can be specified
         level : int or level name or list of ints or list of column names
             if not None, sort on values in specified index level(s)
         ascending : bool or list of bool
             Sort ascending vs. descending. Specify list for multiple sort orders.
         inplace : bool
             if True, perform operation in-place (without creating new instance)
         kind : {‘quicksort’, ‘mergesort’, ‘heapsort’}
             Choice of sorting algorithm. See np.sort for more information. 
             ‘mergesort’ is the only stable algorithm. For data frames, this option is 
             only applied when sorting on a single column or label.
         na_position : {'first', 'last'}
             ‘first’ puts NaNs at the beginning, ‘last’ puts NaNs at the end
         sort_remaining : bool
             if true and sorting by level and index is multilevel, sort by other levels
             too (in order) after sorting by specified level
         """
```

The `sort_index` signatures change too and `sort_columns` is created:

``` python
Series.sort_index(level=0, ascending=True, inplace=False, kind='quicksort', 
                  na_position='last', sort_remaining=True)
DataFrame.sort_index(level=0, axis=0, by=None, ascending=True, inplace=False, 
                     kind='quicksort', na_position='last', sort_remaining=True) 
                     # by is DEPRECATED, see change 7

DataFrame.sort_columns(by=None, level=0, ascending=True, inplace=False, 
                       kind='quicksort', na_position='last', sort_remaining=True)
                       # or maybe level=None
```

Proposed changes:
1. ~~make `inplace=False` default (changes `Series.sort`)~~ _maybe, possibly in 1.0_
2. new `by` argument to accept column-name/list-of-column-names in first position 
   - deprecate `columns` keyword of `DataFrame.sort`, replaced with `by` (df.sort signature would need to retain columns keyword until finally removed but it's not shown in proposal)
   - don't allow tuples to access levels of multi-index (`columns` arg of `DataFrame.sort` allows tuples); use new `level` argument instead
   - don't swap order of `by`/`axis` in `DataFrame.sort_index` (see change 7)
   - this argument is ignored by series but `axis` is too so for the sake of working with dataframes, it gets first position
3. new `level` argument to accept integer/level-name/list-of-ints/list-of-level-names for sorting (multi)index by particular level(s)
   - replaces tuple behavior of `columns` arg of `DataFrame.sort`
   - add `level` argument to `sort_index` in first position so level(s) of multilevel index can be specified; this makes `sort_index`==`sortlevel` (see change 8)
   - also adds `sort_remaining` arg to handle multi-level indexes
4. new method `DataFrame.sort_columns`==`sort(axis=1)` (see syntax below)
5. deprecate `Series.order` since change 1 makes `Series.sort` equivalent (?)
6. add `inplace`, `kind`, and `na_position` arguments to `Series.sort_index` (to match `DataFrame.sort_index`); `by` and `axis` args are not added since they don't make sense for series
7. deprecate and eventually remove `by` argument from `DataFrame.sort_index` since it makes `sort_index` equivalent to `sort`
8. deprecate `sortlevel` since change 3b makes `sort_index` equivalent

Notes:
- default behavior of `sort` is still object-dependent: for series, sorts by values and for data frames, sorts by index
- new `level` arg makes `sort_index` and `sortlevel` equivalent. if sortlevel is retained:
  - should rename `sortlevel` to `sort_level` for naming conventions
  - `Series.sortlevel` should have `inplace` argument added
  - maybe don't add `level` and `sort_remaining` args to `sort_index` so it's not equivalent to `sort_level` (intentionally limiting sort_index seems like a bad idea though)
- it's unclear if default should be `level=None` for `sort_columns`. probably not since level=None falls back to level=0 anyway 
- both `by` and `axis` arguments should be ignored by `Series.sort`

Syntax: 
- dataframes
  - `sort()` == `sort(level=0)` == `sort_index()` == `sortlevel()`
    - without columns or level specified, defaults to current behavior of sort on index
  - `sort(['A','B'])`
    - since columns are specified, default index sort should not occur; sorting only happens using columns 'A' and 'B'
  - `sort(level='spam')` == `sort_index('spam')` == `sortlevel('spam')`
    - sort occurs on row index named 'spam' or level of multi-index named 'spam'
  - `sort(['A','B'], level='spam')`
    - `level` controls here even though columns are specified so sort happens along row index named 'spam' first, then nested sort occurs using columns 'A' and 'B'
  - `sort(axis=1)` == `sort(axis=1, level=0)` == `sort_columns()` 
    - since data frames default to sort on index, leaving level=None is the same as level=0
  - `sort(['A','B'], axis=1)` == `sort_columns(['A','B'])`
    - as with preceding example, level=None becomes level=0 in sort_columns
  - `sort(['A','B'], axis=1, level='spam')` == `sort_columns(['A','B'], level='spam')`
    - `axis` controls `level` so sort will be on columns named 'A' and 'B' in column index named 'spam'
- series:
  - `sort()` == `order()` -- sorts on values
  - with `level` specified, sorts on index/named index/level of multi-index:
    - `sort(level=0)` == `sort_index()` == `sortlevel()`
    - `sort(level='spam')` == `sort_index('spam')` == `sortlevel('spam')`

Comments welcome.


Provide feedback

Saved searches

Use saved searches to filter your results more quickly

API: unified sorting #8239

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

API: unified sorting #8239

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions