DataFrame groupby.first() is much slower than groupby.nth(0) on categorical dtypes

`groupby.first()` is much slower than `groupby.nth(0)` for categorical columns in a very specific way.

Consider carefully the example below, where a dataframe has two columns `c1` and `c2`. The number of unique values in the `c1` column (regardless of its datatype) increases the runtime of `first` _when and only when_ `c2` is a categorical column.

```python
import pandas as pd, numpy as np, timeit

def test(N_CATEGORIES, cat_cols = ['c2']):
    # creates dataframe df with categorical column c1, optional categorical column c2.
    # Times how long grouping by c1 and calling nth(0) and first takes
    global df 
    print(N_CATEGORIES)
    df = pd.DataFrame({
        'c1':np.arange(0,10000)%N_CATEGORIES,
        'c2':np.arange(0,10000)
        })
    for col in cat_cols:
        df[col] = df[col].astype('category')
    t_nth = timeit.timeit("x1 =df.groupby(['c1']).nth(0, dropna='all')", setup="from __main__ import df", number=1)
    t_first = timeit.timeit("x2 = df.groupby(['c1']).first()", setup="from __main__ import df", number=1)
    return t_nth, t_first

test_N_categories = [1,10,100,1000,10000]

# Test when column c2 is categorical
results_c2_cat = pd.DataFrame([test(N, cat_cols=['c2']) for N in test_N_categories], 
                       index=test_N_categories, 
                       columns=['nth','first'])

# Test when column c2 is not categorical
results_c2_not_cat = pd.DataFrame([test(N, cat_cols=[]) for N in test_N_categories], 
                       index=test_N_categories, 
                       columns=['nth','first'])

print(results_c2_cat)
print(results_c2_not_cat)
```

The results are
```python
>>> print(results_c2_cat)
            nth     first
1      0.004204  0.010677
10     0.005890  0.015701
100    0.005305  0.052130
1000   0.004365  0.581036
10000  0.004358  2.847709
>>> print(results_c2_not_cat)
            nth     first
1      0.003479  0.001110
10     0.003027  0.000993
100    0.003297  0.001089
1000   0.003297  0.001057
10000  0.003952  0.001382
```

As shown, when the column `c2` is categorical, the runtime of `first` grows very rapidly as a function of the number of unique values in the column `c1` (and thus the "width" of the groupby).

`first` and `last` both suffer from this problem.

(see also https://github.com/pandas-dev/pandas/issues/19283 and https://github.com/pandas-dev/pandas/issues/19598)

#### Output of ``pd.show_versions()``

<details>

INSTALLED VERSIONS
------------------
commit: None
python: 3.7.1.final.0
python-bits: 64
OS: Windows
OS-release: 10
machine: AMD64
processor: Intel64 Family 6 Model 158 Stepping 9, GenuineIntel
byteorder: little
LC_ALL: None
LANG: None
LOCALE: None.None

pandas: 0.24.1
pytest: None
pip: 19.0.1
setuptools: 39.0.1
Cython: None
numpy: 1.15.3
scipy: 1.1.0
pyarrow: 0.12.0
xarray: 0.11.0
IPython: None
sphinx: None
patsy: 0.5.1
dateutil: 2.7.5
pytz: 2018.6
blosc: None
bottleneck: None
tables: 3.4.4
numexpr: 2.6.8
feather: 0.4.0
matplotlib: 3.0.1
openpyxl: None
xlrd: None
xlwt: None
xlsxwriter: None
lxml.etree: 4.2.6
bs4: 4.6.3
html5lib: None
sqlalchemy: 1.2.17
pymysql: None
psycopg2: 2.7.7 (dt dec pq3 ext lo64)
jinja2: 2.10
s3fs: None
fastparquet: None
pandas_gbq: None
pandas_datareader: None
gcsfs: None
</details>


Provide feedback

Saved searches

Use saved searches to filter your results more quickly

DataFrame groupby.first() is much slower than groupby.nth(0) on categorical dtypes #25397

Output of `pd.show_versions()`

INSTALLED VERSIONS

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

DataFrame groupby.first() is much slower than groupby.nth(0) on categorical dtypes #25397

Description

Output of pd.show_versions()

INSTALLED VERSIONS

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions

Output of `pd.show_versions()`