Skip to content

Commit 5b0767a

Browse files
authored
DOC: User Guide Page on user-defined functions (#61195)
1 parent 5aa78c0 commit 5b0767a

File tree

2 files changed

+306
-0
lines changed

2 files changed

+306
-0
lines changed

doc/source/user_guide/index.rst

Lines changed: 1 addition & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -78,6 +78,7 @@ Guides
7878
boolean
7979
visualization
8080
style
81+
user_defined_functions
8182
groupby
8283
window
8384
timeseries
Lines changed: 305 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,305 @@
1+
.. _user_defined_functions:
2+
3+
{{ header }}
4+
5+
*****************************
6+
User-Defined Functions (UDFs)
7+
*****************************
8+
9+
In pandas, User-Defined Functions (UDFs) provide a way to extend the library’s
10+
functionality by allowing users to apply custom computations to their data. While
11+
pandas comes with a set of built-in functions for data manipulation, UDFs offer
12+
flexibility when built-in methods are not sufficient. These functions can be
13+
applied at different levels: element-wise, row-wise, column-wise, or group-wise,
14+
and behave differently, depending on the method used.
15+
16+
Here’s a simple example to illustrate a UDF applied to a Series:
17+
18+
.. ipython:: python
19+
20+
s = pd.Series([1, 2, 3])
21+
22+
# Simple UDF that adds 1 to a value
23+
def add_one(x):
24+
return x + 1
25+
26+
# Apply the function element-wise using .map
27+
s.map(add_one)
28+
29+
You can also apply UDFs to an entire DataFrame. For example:
30+
31+
.. ipython:: python
32+
33+
df = pd.DataFrame({"A": [1, 2, 3], "B": [10, 20, 30]})
34+
35+
# UDF that takes a row and returns the sum of columns A and B
36+
def sum_row(row):
37+
return row["A"] + row["B"]
38+
39+
# Apply the function row-wise (axis=1 means apply across columns per row)
40+
df.apply(sum_row, axis=1)
41+
42+
43+
Why Not To Use User-Defined Functions
44+
-------------------------------------
45+
46+
While UDFs provide flexibility, they come with significant drawbacks, primarily
47+
related to performance and behavior. When using UDFs, pandas must perform inference
48+
on the result, and that inference could be incorrect. Furthermore, unlike vectorized operations,
49+
UDFs are slower because pandas can't optimize their computations, leading to
50+
inefficient processing.
51+
52+
.. note::
53+
In general, most tasks can and should be accomplished using pandas’ built-in methods or vectorized operations.
54+
55+
Despite their drawbacks, UDFs can be helpful when:
56+
57+
* **Custom Computations Are Needed**: Implementing complex logic or domain-specific calculations that pandas'
58+
built-in methods cannot handle.
59+
* **Extending pandas' Functionality**: Applying external libraries or specialized algorithms unavailable in pandas.
60+
* **Handling Complex Grouped Operations**: Performing operations on grouped data that standard methods do not support.
61+
62+
For example:
63+
64+
.. code-block:: python
65+
66+
from sklearn.linear_model import LinearRegression
67+
68+
# Sample data
69+
df = pd.DataFrame({
70+
'group': ['A', 'A', 'A', 'B', 'B', 'B'],
71+
'x': [1, 2, 3, 1, 2, 3],
72+
'y': [2, 4, 6, 1, 2, 1.5]
73+
})
74+
75+
# Function to fit a model to each group
76+
def fit_model(group):
77+
model = LinearRegression()
78+
model.fit(group[['x']], group['y'])
79+
group['y_pred'] = model.predict(group[['x']])
80+
return group
81+
82+
result = df.groupby('group').apply(fit_model)
83+
84+
85+
Methods that support User-Defined Functions
86+
-------------------------------------------
87+
88+
User-Defined Functions can be applied across various pandas methods:
89+
90+
+----------------------------+------------------------+--------------------------+----------------------------------------------------------------------------------------------------------------------------------------------+
91+
| Method | Function Input | Function Output | Description |
92+
+============================+========================+==========================+==============================================================================================================================================+
93+
| :meth:`map` | Scalar | Scalar | Apply a function to each element |
94+
+----------------------------+------------------------+--------------------------+----------------------------------------------------------------------------------------------------------------------------------------------+
95+
| :meth:`apply` (axis=0) | Column (Series) | Column (Series) | Apply a function to each column |
96+
+----------------------------+------------------------+--------------------------+----------------------------------------------------------------------------------------------------------------------------------------------+
97+
| :meth:`apply` (axis=1) | Row (Series) | Row (Series) | Apply a function to each row |
98+
+----------------------------+------------------------+--------------------------+----------------------------------------------------------------------------------------------------------------------------------------------+
99+
| :meth:`agg` | Series/DataFrame | Scalar or Series | Aggregate and summarizes values, e.g., sum or custom reducer |
100+
+----------------------------+------------------------+--------------------------+----------------------------------------------------------------------------------------------------------------------------------------------+
101+
| :meth:`transform` (axis=0) | Column (Series) | Column(Series) | Same as :meth:`apply` with (axis=0), but it raises an exception if the function changes the shape of the data |
102+
+----------------------------+------------------------+--------------------------+----------------------------------------------------------------------------------------------------------------------------------------------+
103+
| :meth:`transform` (axis=1) | Row (Series) | Row (Series) | Same as :meth:`apply` with (axis=1), but it raises an exception if the function changes the shape of the data |
104+
+----------------------------+------------------------+--------------------------+----------------------------------------------------------------------------------------------------------------------------------------------+
105+
| :meth:`filter` | Series or DataFrame | Boolean | Only accepts UDFs in group by. Function is called for each group, and the group is removed from the result if the function returns ``False`` |
106+
+----------------------------+------------------------+--------------------------+----------------------------------------------------------------------------------------------------------------------------------------------+
107+
| :meth:`pipe` | Series/DataFrame | Series/DataFrame | Chain functions together to apply to Series or Dataframe |
108+
+----------------------------+------------------------+--------------------------+----------------------------------------------------------------------------------------------------------------------------------------------+
109+
110+
When applying UDFs in pandas, it is essential to select the appropriate method based
111+
on your specific task. Each method has its strengths and is designed for different use
112+
cases. Understanding the purpose and behavior of each method will help you make informed
113+
decisions, ensuring more efficient and maintainable code.
114+
115+
.. note::
116+
Some of these methods are can also be applied to groupby, resample, and various window objects.
117+
See :ref:`groupby`, :ref:`resample()<timeseries>`, :ref:`rolling()<window>`, :ref:`expanding()<window>`,
118+
and :ref:`ewm()<window>` for details.
119+
120+
121+
:meth:`DataFrame.apply`
122+
~~~~~~~~~~~~~~~~~~~~~~~
123+
124+
The :meth:`apply` method allows you to apply UDFs along either rows or columns. While flexible,
125+
it is slower than vectorized operations and should be used only when you need operations
126+
that cannot be achieved with built-in pandas functions.
127+
128+
When to use: :meth:`apply` is suitable when no alternative vectorized method or UDF method is available,
129+
but consider optimizing performance with vectorized operations wherever possible.
130+
131+
:meth:`DataFrame.agg`
132+
~~~~~~~~~~~~~~~~~~~~~
133+
134+
If you need to aggregate data, :meth:`agg` is a better choice than apply because it is
135+
specifically designed for aggregation operations.
136+
137+
When to use: Use :meth:`agg` for performing custom aggregations, where the operation returns
138+
a scalar value on each input.
139+
140+
:meth:`DataFrame.transform`
141+
~~~~~~~~~~~~~~~~~~~~~~~~~~~
142+
143+
The :meth:`transform` method is ideal for performing element-wise transformations while preserving the shape of the original DataFrame.
144+
It is generally faster than apply because it can take advantage of pandas' internal optimizations.
145+
146+
When to use: When you need to perform element-wise transformations that retain the original structure of the DataFrame.
147+
148+
.. code-block:: python
149+
150+
from sklearn.linear_model import LinearRegression
151+
152+
df = pd.DataFrame({
153+
'group': ['A', 'A', 'A', 'B', 'B', 'B'],
154+
'x': [1, 2, 3, 1, 2, 3],
155+
'y': [2, 4, 6, 1, 2, 1.5]
156+
}).set_index("x")
157+
158+
# Function to fit a model to each group
159+
def fit_model(group):
160+
x = group.index.to_frame()
161+
y = group
162+
model = LinearRegression()
163+
model.fit(x, y)
164+
pred = model.predict(x)
165+
return pred
166+
167+
result = df.groupby('group').transform(fit_model)
168+
169+
:meth:`DataFrame.filter`
170+
~~~~~~~~~~~~~~~~~~~~~~~~
171+
172+
The :meth:`filter` method is used to select subsets of the DataFrame’s
173+
columns or row. It is useful when you want to extract specific columns or rows that
174+
match particular conditions.
175+
176+
When to use: Use :meth:`filter` when you want to use a UDF to create a subset of a DataFrame or Series
177+
178+
.. note::
179+
:meth:`DataFrame.filter` does not accept UDFs, but can accept
180+
list comprehensions that have UDFs applied to them.
181+
182+
.. ipython:: python
183+
184+
# Sample DataFrame
185+
df = pd.DataFrame({
186+
'AA': [1, 2, 3],
187+
'BB': [4, 5, 6],
188+
'C': [7, 8, 9],
189+
'D': [10, 11, 12]
190+
})
191+
192+
# Function that filters out columns where the name is longer than 1 character
193+
def is_long_name(column_name):
194+
return len(column_name) > 1
195+
196+
df_filtered = df.filter(items=[col for col in df.columns if is_long_name(col)])
197+
print(df_filtered)
198+
199+
Since filter does not directly accept a UDF, you have to apply the UDF indirectly,
200+
for example, by using list comprehensions.
201+
202+
:meth:`DataFrame.map`
203+
~~~~~~~~~~~~~~~~~~~~~
204+
205+
The :meth:`map` method is used specifically to apply element-wise UDFs.
206+
207+
When to use: Use :meth:`map` for applying element-wise UDFs to DataFrames or Series.
208+
209+
:meth:`DataFrame.pipe`
210+
~~~~~~~~~~~~~~~~~~~~~~
211+
212+
The :meth:`pipe` method is useful for chaining operations together into a clean and readable pipeline.
213+
It is a helpful tool for organizing complex data processing workflows.
214+
215+
When to use: Use :meth:`pipe` when you need to create a pipeline of operations and want to keep the code readable and maintainable.
216+
217+
218+
Performance
219+
-----------
220+
221+
While UDFs provide flexibility, their use is generally discouraged as they can introduce
222+
performance issues, especially when written in pure Python. To improve efficiency,
223+
consider using built-in ``NumPy`` or ``pandas`` functions instead of UDFs
224+
for common operations.
225+
226+
.. note::
227+
If performance is critical, explore **vectorized operations** before resorting
228+
to UDFs.
229+
230+
Vectorized Operations
231+
~~~~~~~~~~~~~~~~~~~~~
232+
233+
Below is a comparison of using UDFs versus using Vectorized Operations:
234+
235+
.. code-block:: python
236+
237+
# User-defined function
238+
def calc_ratio(row):
239+
return 100 * (row["one"] / row["two"])
240+
241+
df["new_col"] = df.apply(calc_ratio, axis=1)
242+
243+
# Vectorized Operation
244+
df["new_col2"] = 100 * (df["one"] / df["two"])
245+
246+
Measuring how long each operation takes:
247+
248+
.. code-block:: text
249+
250+
User-defined function: 5.6435 secs
251+
Vectorized: 0.0043 secs
252+
253+
Vectorized operations in pandas are significantly faster than using :meth:`DataFrame.apply`
254+
with UDFs because they leverage highly optimized C functions
255+
via ``NumPy`` to process entire arrays at once. This approach avoids the overhead of looping
256+
through rows in Python and making separate function calls for each row, which is slow and
257+
inefficient. Additionally, ``NumPy`` arrays benefit from memory efficiency and CPU-level
258+
optimizations, making vectorized operations the preferred choice whenever possible.
259+
260+
261+
Improving Performance with UDFs
262+
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
263+
264+
In scenarios where UDFs are necessary, there are still ways to mitigate their performance drawbacks.
265+
One approach is to use **Numba**, a Just-In-Time (JIT) compiler that can significantly speed up numerical
266+
Python code by compiling Python functions to optimized machine code at runtime.
267+
268+
By annotating your UDFs with ``@numba.jit``, you can achieve performance closer to vectorized operations,
269+
especially for computationally heavy tasks.
270+
271+
.. note::
272+
You may also refer to the user guide on `Enhancing performance <https://pandas.pydata.org/pandas-docs/dev/user_guide/enhancingperf.html#numba-jit-compilation>`_
273+
for a more detailed guide to using **Numba**.
274+
275+
Using :meth:`DataFrame.pipe` for Composable Logic
276+
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
277+
278+
Another useful pattern for improving readability and composability, especially when mixing
279+
vectorized logic with UDFs, is to use the :meth:`DataFrame.pipe` method.
280+
281+
:meth:`DataFrame.pipe` doesn't improve performance directly, but it enables cleaner
282+
method chaining by passing the entire object into a function. This is especially helpful
283+
when chaining custom transformations:
284+
285+
.. code-block:: python
286+
287+
def add_ratio_column(df):
288+
df["ratio"] = 100 * (df["one"] / df["two"])
289+
return df
290+
291+
df = (
292+
df
293+
.query("one > 0")
294+
.pipe(add_ratio_column)
295+
.dropna()
296+
)
297+
298+
This is functionally equivalent to calling ``add_ratio_column(df)``, but keeps your code
299+
clean and composable. The function you pass to :meth:`DataFrame.pipe` can use vectorized operations,
300+
row-wise UDFs, or any other logic; :meth:`DataFrame.pipe` is agnostic.
301+
302+
.. note::
303+
While :meth:`DataFrame.pipe` does not improve performance on its own,
304+
it promotes clean, modular design and allows both vectorized and UDF-based logic
305+
to be composed in method chains.

0 commit comments

Comments
 (0)