Description
Pandas version checks
-
I have checked that this issue has not already been reported.
-
I have confirmed this bug exists on the latest version of pandas.
-
I have confirmed this bug exists on the main branch of pandas.
Reproducible Example
import pandas as pd
import numpy as np
df = pd.DataFrame(
{
"A": [1, 2, 3, 4, 5],
"B": [10, 20, 30, 40, 50],
"groupby1": ["diamond", "diamond", "spade", "spade", "spade"]
}
)
def user_defined_aggfunc(df, addition=0):
return np.mean(df) + addition
out_df = df.groupby(by="groupby1").agg(func=user_defined_aggfunc, addition=5)
out_df
A B
groupby1
diamond 13.25 13.25
spade 27.00 27.00
Issue Description
I am currently working on a PR to add numeric_only
support to DataFrameGroupBy.agg()
as it relates to some work on Geopandas I am currently trying to resolve. #56946, #58132
I noted there is an edge case that is currently not working as intended. This edge case occurs under the following conditions:
- A single callable function is submitted as
func
- A
kwarg
is included - There is more than one column in the dataframe.
The first 2 conditions is what would lead it to use: DataFrameGroupBy._aggregate_frame
The third condition causes the bug in DataFrameGroupBy._aggregate_frame
because the aggfunc would be applied to the entire dataframe, as opposed to column by column.
This has actually been touched upon in an existing test:
def test_pass_args_kwargs_duplicate_columns(tsframe, as_index):
# go through _aggregate_frame with self.axis == 0 and duplicate columns
tsframe.columns = ["A", "B", "A", "C"]
gb = tsframe.groupby(lambda x: x.month, as_index=as_index)
res = gb.agg(np.percentile, 80, axis=0)
ex_data = {
1: tsframe[tsframe.index.month == 1].quantile(0.8),
2: tsframe[tsframe.index.month == 2].quantile(0.8),
}
expected = DataFrame(ex_data).T
if not as_index:
expected.insert(0, "index", [1, 2])
expected.index = Index(range(2))
tm.assert_frame_equal(res, expected)
Where the kwarg axis=0
was fed into agg()
. Three comments I have on this is that:
- it isn't necessarily obvious that this argument is required based on the documentation.
- not putting in that argument doesn't raise an error nor warning, but takes the 80th percentile in that group for all columns
- if using a UDF, such an argument would not be practical to be required in all cases
The root issue is how the aggfunc is applied in GroupByDataFrame._aggregate_frame()
when there is more than one column. Without the axis=0
kwarg, it applies the function on all columns to get a single value.
Below is a very braindead example of something that might work for _aggregate_frame
:
def _aggregate_frame(self, func, *args, **kwargs) -> DataFrame:
if self._grouper.nkeys != 1:
raise AssertionError("Number of keys must be 1")
obj = self._obj_with_exclusions
result_index = self._grouper.result_index
out = self.obj._constructor([], index=obj.columns, columns=result_index)
for name, grp_df in self._grouper.get_iterator(obj):
aggregated_row_data = list()
for counter, _ in enumerate(grp_df.columns):
fres = func(grp_df.iloc[:,counter], *args, **kwargs)
aggregated_row_data.append(fres)
out[name] = aggregated_row_data
out = out.T
return out
There are a number of things that need to be accounted for:
- Need to consider that we can have duplicate column names, therefore we can not right out to the
result
dict()
as dictionaries don't allow duplicates. Hence why the object is created to at the start and written to in the for loop - If calling a dataframe[column_name], where there are duplicates names, all the data in the columns will show up and the function will be applied to all of them. Hence why iloc is used.
- To address wanting the function to be operated on only in the column-by-column, it's going through a for-loop.
This was just a naive approach to the issue but would like to confirm that this would be considered an issue before giving serious thought about a more appropriate fix.
Expected Behavior
A B
groupby1
diamond 6.5 20.0
spade 9.0 45.0
Installed Versions
INSTALLED VERSIONS
commit : 6f39c4f
python : 3.10.14.final.0
python-bits : 64
OS : Linux
OS-release : 6.5.0-26-generic
Version : #26~22.04.1-Ubuntu SMP PREEMPT_DYNAMIC Tue Mar 12 10:22:43 UTC 2
machine : x86_64
processor : x86_64
byteorder : little
LC_ALL : None
LANG : en_AU.UTF-8
LOCALE : en_AU.UTF-8
pandas : 3.0.0.dev0+666.g6f39c4fa1f.dirty
numpy : 1.26.4
pytz : 2024.1
dateutil : 2.9.0
setuptools : 69.2.0
pip : 24.0
Cython : 3.0.10
pytest : 8.1.1
hypothesis : 6.100.0
sphinx : 7.2.6
blosc : None
feather : None
xlsxwriter : 3.1.9
lxml.etree : 5.1.0
html5lib : 1.1
pymysql : 1.4.6
psycopg2 : 2.9.9
jinja2 : 3.1.3
IPython : 8.22.2
pandas_datareader : None
adbc-driver-postgresql: None
adbc-driver-sqlite : None
bs4 : 4.12.3
bottleneck : 1.3.8
fastparquet : 2024.2.0
fsspec : 2024.3.1
gcsfs : 2024.3.1
matplotlib : 3.8.3
numba : 0.59.1
numexpr : 2.9.0
odfpy : None
openpyxl : 3.1.2
pyarrow : 15.0.2
pyreadstat : 1.2.7
python-calamine : None
pyxlsb : 1.0.10
s3fs : 2024.3.1
scipy : 1.12.0
sqlalchemy : 2.0.29
tables : 3.9.2
tabulate : 0.9.0
xarray : 2024.3.0
xlrd : 2.0.1
zstandard : 0.22.0
tzdata : 2024.1
qtpy : None
pyqt5 : None