Skip to content

Commit 6b22d12

Browse files
author
Patrick Park
committed
Add section on handling categorical values
1 parent 7ed1f53 commit 6b22d12

File tree

1 file changed

+54
-0
lines changed

1 file changed

+54
-0
lines changed

doc/source/groupby.rst

+54
Original file line numberDiff line numberDiff line change
@@ -989,6 +989,60 @@ Note that ``df.groupby('A').colname.std().`` is more efficient than
989989
is only interesting over one column (here ``colname``), it may be filtered
990990
*before* applying the aggregation function.
991991

992+
.. _groupby.observed:
993+
994+
Handling of (un)observed Categorical values
995+
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
996+
997+
When using a ``Categorical`` grouper (as a single grouper, or as part of multiple groupers), the ``observed`` keyword
998+
controls whether to return a cartesian product of all possible groupers values (``observed=False``) or only those
999+
that are observed groupers (``observed=True``).
1000+
1001+
Show all values:
1002+
1003+
.. ipython:: python
1004+
1005+
pd.Series([1, 1, 1]).groupby(pd.Categorical(['a', 'a', 'a'], categories=['a', 'b']), observed=False).count()
1006+
1007+
Show only the observed values:
1008+
1009+
.. ipython:: python
1010+
1011+
pd.Series([1, 1, 1]).groupby(pd.Categorical(['a', 'a', 'a'], categories=['a', 'b']), observed=True).count()
1012+
1013+
The returned dtype of the grouped will *always* include *all* of the categories that were grouped.
1014+
1015+
.. ipython:: python
1016+
1017+
s = pd.Series([1, 1, 1]).groupby(pd.Categorical(['a', 'a', 'a'], categories=['a', 'b']), observed=False).count()
1018+
s.index.dtype
1019+
1020+
.. note::
1021+
Decimal and object columns are also "nuisance" columns. They are excluded from aggregate functions automatically in groupby.
1022+
1023+
If you do wish to include decimal or object columns in an aggregation with other non-nuisance data types, you must do so explicitly.
1024+
1025+
.. ipython:: python
1026+
1027+
from decimal import Decimal
1028+
dec = pd.DataFrame(
1029+
{'id': [123, 456, 123, 456],
1030+
'int_column': [1, 2, 3, 4],
1031+
'dec_column1': [Decimal('0.50'), Decimal('0.15'), Decimal('0.25'), Decimal('0.40')]
1032+
},
1033+
columns=['id','int_column','dec_column']
1034+
)
1035+
1036+
# Decimal columns can be sum'd explicitly by themselves...
1037+
dec.groupby(['id'], as_index=False)['dec_column'].sum()
1038+
1039+
# ...but cannot be combined with standard data types or they will be excluded
1040+
dec.groupby(['id'], as_index=False)['int_column','dec_column'].sum()
1041+
1042+
# Use .agg function to aggregate over standard and "nuisance" data types at the same time
1043+
dec.groupby(['id'], as_index=False).agg({'int_column': 'sum', 'dec_column': 'sum'})
1044+
1045+
9921046
.. _groupby.missing:
9931047

9941048
NA and NaT group handling

0 commit comments

Comments
 (0)