@@ -989,6 +989,60 @@ Note that ``df.groupby('A').colname.std().`` is more efficient than
989
989
is only interesting over one column (here ``colname ``), it may be filtered
990
990
*before * applying the aggregation function.
991
991
992
+ .. _groupby.observed :
993
+
994
+ Handling of (un)observed Categorical values
995
+ ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
996
+
997
+ When using a ``Categorical `` grouper (as a single grouper, or as part of multiple groupers), the ``observed `` keyword
998
+ controls whether to return a cartesian product of all possible groupers values (``observed=False ``) or only those
999
+ that are observed groupers (``observed=True ``).
1000
+
1001
+ Show all values:
1002
+
1003
+ .. ipython :: python
1004
+
1005
+ pd.Series([1 , 1 , 1 ]).groupby(pd.Categorical([' a' , ' a' , ' a' ], categories = [' a' , ' b' ]), observed = False ).count()
1006
+
1007
+ Show only the observed values:
1008
+
1009
+ .. ipython :: python
1010
+
1011
+ pd.Series([1 , 1 , 1 ]).groupby(pd.Categorical([' a' , ' a' , ' a' ], categories = [' a' , ' b' ]), observed = True ).count()
1012
+
1013
+ The returned dtype of the grouped will *always * include *all * of the categories that were grouped.
1014
+
1015
+ .. ipython :: python
1016
+
1017
+ s = pd.Series([1 , 1 , 1 ]).groupby(pd.Categorical([' a' , ' a' , ' a' ], categories = [' a' , ' b' ]), observed = False ).count()
1018
+ s.index.dtype
1019
+
1020
+ .. note ::
1021
+ Decimal and object columns are also "nuisance" columns. They are excluded from aggregate functions automatically in groupby.
1022
+
1023
+ If you do wish to include decimal or object columns in an aggregation with other non-nuisance data types, you must do so explicitly.
1024
+
1025
+ .. ipython :: python
1026
+
1027
+ from decimal import Decimal
1028
+ dec = pd.DataFrame(
1029
+ {' id' : [123 , 456 , 123 , 456 ],
1030
+ ' int_column' : [1 , 2 , 3 , 4 ],
1031
+ ' dec_column1' : [Decimal(' 0.50' ), Decimal(' 0.15' ), Decimal(' 0.25' ), Decimal(' 0.40' )]
1032
+ },
1033
+ columns = [' id' ,' int_column' ,' dec_column' ]
1034
+ )
1035
+
1036
+ # Decimal columns can be sum'd explicitly by themselves...
1037
+ dec.groupby([' id' ], as_index = False )[' dec_column' ].sum()
1038
+
1039
+ # ...but cannot be combined with standard data types or they will be excluded
1040
+ dec.groupby([' id' ], as_index = False )[' int_column' ,' dec_column' ].sum()
1041
+
1042
+ # Use .agg function to aggregate over standard and "nuisance" data types at the same time
1043
+ dec.groupby([' id' ], as_index = False ).agg({' int_column' : ' sum' , ' dec_column' : ' sum' })
1044
+
1045
+
992
1046
.. _groupby.missing :
993
1047
994
1048
NA and NaT group handling
0 commit comments