|
| 1 | +.. _user_defined_functions: |
| 2 | + |
| 3 | +{{ header }} |
| 4 | + |
| 5 | +***************************** |
| 6 | +User-Defined Functions (UDFs) |
| 7 | +***************************** |
| 8 | + |
| 9 | +In pandas, User-Defined Functions (UDFs) provide a way to extend the library’s |
| 10 | +functionality by allowing users to apply custom computations to their data. While |
| 11 | +pandas comes with a set of built-in functions for data manipulation, UDFs offer |
| 12 | +flexibility when built-in methods are not sufficient. These functions can be |
| 13 | +applied at different levels: element-wise, row-wise, column-wise, or group-wise, |
| 14 | +and behave differently, depending on the method used. |
| 15 | + |
| 16 | +Here’s a simple example to illustrate a UDF applied to a Series: |
| 17 | + |
| 18 | +.. ipython:: python |
| 19 | +
|
| 20 | + s = pd.Series([1, 2, 3]) |
| 21 | +
|
| 22 | + # Simple UDF that adds 1 to a value |
| 23 | + def add_one(x): |
| 24 | + return x + 1 |
| 25 | +
|
| 26 | + # Apply the function element-wise using .map |
| 27 | + s.map(add_one) |
| 28 | +
|
| 29 | +You can also apply UDFs to an entire DataFrame. For example: |
| 30 | + |
| 31 | +.. ipython:: python |
| 32 | +
|
| 33 | + df = pd.DataFrame({"A": [1, 2, 3], "B": [10, 20, 30]}) |
| 34 | +
|
| 35 | + # UDF that takes a row and returns the sum of columns A and B |
| 36 | + def sum_row(row): |
| 37 | + return row["A"] + row["B"] |
| 38 | +
|
| 39 | + # Apply the function row-wise (axis=1 means apply across columns per row) |
| 40 | + df.apply(sum_row, axis=1) |
| 41 | +
|
| 42 | +
|
| 43 | +Why Not To Use User-Defined Functions |
| 44 | +------------------------------------- |
| 45 | + |
| 46 | +While UDFs provide flexibility, they come with significant drawbacks, primarily |
| 47 | +related to performance and behavior. When using UDFs, pandas must perform inference |
| 48 | +on the result, and that inference could be incorrect. Furthermore, unlike vectorized operations, |
| 49 | +UDFs are slower because pandas can't optimize their computations, leading to |
| 50 | +inefficient processing. |
| 51 | + |
| 52 | +.. note:: |
| 53 | + In general, most tasks can and should be accomplished using pandas’ built-in methods or vectorized operations. |
| 54 | + |
| 55 | +Despite their drawbacks, UDFs can be helpful when: |
| 56 | + |
| 57 | +* **Custom Computations Are Needed**: Implementing complex logic or domain-specific calculations that pandas' |
| 58 | + built-in methods cannot handle. |
| 59 | +* **Extending pandas' Functionality**: Applying external libraries or specialized algorithms unavailable in pandas. |
| 60 | +* **Handling Complex Grouped Operations**: Performing operations on grouped data that standard methods do not support. |
| 61 | + |
| 62 | +For example: |
| 63 | + |
| 64 | +.. code-block:: python |
| 65 | +
|
| 66 | + from sklearn.linear_model import LinearRegression |
| 67 | +
|
| 68 | + # Sample data |
| 69 | + df = pd.DataFrame({ |
| 70 | + 'group': ['A', 'A', 'A', 'B', 'B', 'B'], |
| 71 | + 'x': [1, 2, 3, 1, 2, 3], |
| 72 | + 'y': [2, 4, 6, 1, 2, 1.5] |
| 73 | + }) |
| 74 | +
|
| 75 | + # Function to fit a model to each group |
| 76 | + def fit_model(group): |
| 77 | + model = LinearRegression() |
| 78 | + model.fit(group[['x']], group['y']) |
| 79 | + group['y_pred'] = model.predict(group[['x']]) |
| 80 | + return group |
| 81 | +
|
| 82 | + result = df.groupby('group').apply(fit_model) |
| 83 | +
|
| 84 | +
|
| 85 | +Methods that support User-Defined Functions |
| 86 | +------------------------------------------- |
| 87 | + |
| 88 | +User-Defined Functions can be applied across various pandas methods: |
| 89 | + |
| 90 | ++----------------------------+------------------------+--------------------------+----------------------------------------------------------------------------------------------------------------------------------------------+ |
| 91 | +| Method | Function Input | Function Output | Description | |
| 92 | ++============================+========================+==========================+==============================================================================================================================================+ |
| 93 | +| :meth:`map` | Scalar | Scalar | Apply a function to each element | |
| 94 | ++----------------------------+------------------------+--------------------------+----------------------------------------------------------------------------------------------------------------------------------------------+ |
| 95 | +| :meth:`apply` (axis=0) | Column (Series) | Column (Series) | Apply a function to each column | |
| 96 | ++----------------------------+------------------------+--------------------------+----------------------------------------------------------------------------------------------------------------------------------------------+ |
| 97 | +| :meth:`apply` (axis=1) | Row (Series) | Row (Series) | Apply a function to each row | |
| 98 | ++----------------------------+------------------------+--------------------------+----------------------------------------------------------------------------------------------------------------------------------------------+ |
| 99 | +| :meth:`agg` | Series/DataFrame | Scalar or Series | Aggregate and summarizes values, e.g., sum or custom reducer | |
| 100 | ++----------------------------+------------------------+--------------------------+----------------------------------------------------------------------------------------------------------------------------------------------+ |
| 101 | +| :meth:`transform` (axis=0) | Column (Series) | Column(Series) | Same as :meth:`apply` with (axis=0), but it raises an exception if the function changes the shape of the data | |
| 102 | ++----------------------------+------------------------+--------------------------+----------------------------------------------------------------------------------------------------------------------------------------------+ |
| 103 | +| :meth:`transform` (axis=1) | Row (Series) | Row (Series) | Same as :meth:`apply` with (axis=1), but it raises an exception if the function changes the shape of the data | |
| 104 | ++----------------------------+------------------------+--------------------------+----------------------------------------------------------------------------------------------------------------------------------------------+ |
| 105 | +| :meth:`filter` | Series or DataFrame | Boolean | Only accepts UDFs in group by. Function is called for each group, and the group is removed from the result if the function returns ``False`` | |
| 106 | ++----------------------------+------------------------+--------------------------+----------------------------------------------------------------------------------------------------------------------------------------------+ |
| 107 | +| :meth:`pipe` | Series/DataFrame | Series/DataFrame | Chain functions together to apply to Series or Dataframe | |
| 108 | ++----------------------------+------------------------+--------------------------+----------------------------------------------------------------------------------------------------------------------------------------------+ |
| 109 | + |
| 110 | +When applying UDFs in pandas, it is essential to select the appropriate method based |
| 111 | +on your specific task. Each method has its strengths and is designed for different use |
| 112 | +cases. Understanding the purpose and behavior of each method will help you make informed |
| 113 | +decisions, ensuring more efficient and maintainable code. |
| 114 | + |
| 115 | +.. note:: |
| 116 | + Some of these methods are can also be applied to groupby, resample, and various window objects. |
| 117 | + See :ref:`groupby`, :ref:`resample()<timeseries>`, :ref:`rolling()<window>`, :ref:`expanding()<window>`, |
| 118 | + and :ref:`ewm()<window>` for details. |
| 119 | + |
| 120 | + |
| 121 | +:meth:`DataFrame.apply` |
| 122 | +~~~~~~~~~~~~~~~~~~~~~~~ |
| 123 | + |
| 124 | +The :meth:`apply` method allows you to apply UDFs along either rows or columns. While flexible, |
| 125 | +it is slower than vectorized operations and should be used only when you need operations |
| 126 | +that cannot be achieved with built-in pandas functions. |
| 127 | + |
| 128 | +When to use: :meth:`apply` is suitable when no alternative vectorized method or UDF method is available, |
| 129 | +but consider optimizing performance with vectorized operations wherever possible. |
| 130 | + |
| 131 | +:meth:`DataFrame.agg` |
| 132 | +~~~~~~~~~~~~~~~~~~~~~ |
| 133 | + |
| 134 | +If you need to aggregate data, :meth:`agg` is a better choice than apply because it is |
| 135 | +specifically designed for aggregation operations. |
| 136 | + |
| 137 | +When to use: Use :meth:`agg` for performing custom aggregations, where the operation returns |
| 138 | +a scalar value on each input. |
| 139 | + |
| 140 | +:meth:`DataFrame.transform` |
| 141 | +~~~~~~~~~~~~~~~~~~~~~~~~~~~ |
| 142 | + |
| 143 | +The :meth:`transform` method is ideal for performing element-wise transformations while preserving the shape of the original DataFrame. |
| 144 | +It is generally faster than apply because it can take advantage of pandas' internal optimizations. |
| 145 | + |
| 146 | +When to use: When you need to perform element-wise transformations that retain the original structure of the DataFrame. |
| 147 | + |
| 148 | +.. code-block:: python |
| 149 | +
|
| 150 | + from sklearn.linear_model import LinearRegression |
| 151 | +
|
| 152 | + df = pd.DataFrame({ |
| 153 | + 'group': ['A', 'A', 'A', 'B', 'B', 'B'], |
| 154 | + 'x': [1, 2, 3, 1, 2, 3], |
| 155 | + 'y': [2, 4, 6, 1, 2, 1.5] |
| 156 | + }).set_index("x") |
| 157 | +
|
| 158 | + # Function to fit a model to each group |
| 159 | + def fit_model(group): |
| 160 | + x = group.index.to_frame() |
| 161 | + y = group |
| 162 | + model = LinearRegression() |
| 163 | + model.fit(x, y) |
| 164 | + pred = model.predict(x) |
| 165 | + return pred |
| 166 | +
|
| 167 | + result = df.groupby('group').transform(fit_model) |
| 168 | +
|
| 169 | +:meth:`DataFrame.filter` |
| 170 | +~~~~~~~~~~~~~~~~~~~~~~~~ |
| 171 | + |
| 172 | +The :meth:`filter` method is used to select subsets of the DataFrame’s |
| 173 | +columns or row. It is useful when you want to extract specific columns or rows that |
| 174 | +match particular conditions. |
| 175 | + |
| 176 | +When to use: Use :meth:`filter` when you want to use a UDF to create a subset of a DataFrame or Series |
| 177 | + |
| 178 | +.. note:: |
| 179 | + :meth:`DataFrame.filter` does not accept UDFs, but can accept |
| 180 | + list comprehensions that have UDFs applied to them. |
| 181 | + |
| 182 | +.. ipython:: python |
| 183 | +
|
| 184 | + # Sample DataFrame |
| 185 | + df = pd.DataFrame({ |
| 186 | + 'AA': [1, 2, 3], |
| 187 | + 'BB': [4, 5, 6], |
| 188 | + 'C': [7, 8, 9], |
| 189 | + 'D': [10, 11, 12] |
| 190 | + }) |
| 191 | +
|
| 192 | + # Function that filters out columns where the name is longer than 1 character |
| 193 | + def is_long_name(column_name): |
| 194 | + return len(column_name) > 1 |
| 195 | +
|
| 196 | + df_filtered = df.filter(items=[col for col in df.columns if is_long_name(col)]) |
| 197 | + print(df_filtered) |
| 198 | +
|
| 199 | +Since filter does not directly accept a UDF, you have to apply the UDF indirectly, |
| 200 | +for example, by using list comprehensions. |
| 201 | + |
| 202 | +:meth:`DataFrame.map` |
| 203 | +~~~~~~~~~~~~~~~~~~~~~ |
| 204 | + |
| 205 | +The :meth:`map` method is used specifically to apply element-wise UDFs. |
| 206 | + |
| 207 | +When to use: Use :meth:`map` for applying element-wise UDFs to DataFrames or Series. |
| 208 | + |
| 209 | +:meth:`DataFrame.pipe` |
| 210 | +~~~~~~~~~~~~~~~~~~~~~~ |
| 211 | + |
| 212 | +The :meth:`pipe` method is useful for chaining operations together into a clean and readable pipeline. |
| 213 | +It is a helpful tool for organizing complex data processing workflows. |
| 214 | + |
| 215 | +When to use: Use :meth:`pipe` when you need to create a pipeline of operations and want to keep the code readable and maintainable. |
| 216 | + |
| 217 | + |
| 218 | +Performance |
| 219 | +----------- |
| 220 | + |
| 221 | +While UDFs provide flexibility, their use is generally discouraged as they can introduce |
| 222 | +performance issues, especially when written in pure Python. To improve efficiency, |
| 223 | +consider using built-in ``NumPy`` or ``pandas`` functions instead of UDFs |
| 224 | +for common operations. |
| 225 | + |
| 226 | +.. note:: |
| 227 | + If performance is critical, explore **vectorized operations** before resorting |
| 228 | + to UDFs. |
| 229 | + |
| 230 | +Vectorized Operations |
| 231 | +~~~~~~~~~~~~~~~~~~~~~ |
| 232 | + |
| 233 | +Below is a comparison of using UDFs versus using Vectorized Operations: |
| 234 | + |
| 235 | +.. code-block:: python |
| 236 | +
|
| 237 | + # User-defined function |
| 238 | + def calc_ratio(row): |
| 239 | + return 100 * (row["one"] / row["two"]) |
| 240 | +
|
| 241 | + df["new_col"] = df.apply(calc_ratio, axis=1) |
| 242 | +
|
| 243 | + # Vectorized Operation |
| 244 | + df["new_col2"] = 100 * (df["one"] / df["two"]) |
| 245 | +
|
| 246 | +Measuring how long each operation takes: |
| 247 | + |
| 248 | +.. code-block:: text |
| 249 | +
|
| 250 | + User-defined function: 5.6435 secs |
| 251 | + Vectorized: 0.0043 secs |
| 252 | +
|
| 253 | +Vectorized operations in pandas are significantly faster than using :meth:`DataFrame.apply` |
| 254 | +with UDFs because they leverage highly optimized C functions |
| 255 | +via ``NumPy`` to process entire arrays at once. This approach avoids the overhead of looping |
| 256 | +through rows in Python and making separate function calls for each row, which is slow and |
| 257 | +inefficient. Additionally, ``NumPy`` arrays benefit from memory efficiency and CPU-level |
| 258 | +optimizations, making vectorized operations the preferred choice whenever possible. |
| 259 | + |
| 260 | + |
| 261 | +Improving Performance with UDFs |
| 262 | +~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ |
| 263 | + |
| 264 | +In scenarios where UDFs are necessary, there are still ways to mitigate their performance drawbacks. |
| 265 | +One approach is to use **Numba**, a Just-In-Time (JIT) compiler that can significantly speed up numerical |
| 266 | +Python code by compiling Python functions to optimized machine code at runtime. |
| 267 | + |
| 268 | +By annotating your UDFs with ``@numba.jit``, you can achieve performance closer to vectorized operations, |
| 269 | +especially for computationally heavy tasks. |
| 270 | + |
| 271 | +.. note:: |
| 272 | + You may also refer to the user guide on `Enhancing performance <https://pandas.pydata.org/pandas-docs/dev/user_guide/enhancingperf.html#numba-jit-compilation>`_ |
| 273 | + for a more detailed guide to using **Numba**. |
| 274 | + |
| 275 | +Using :meth:`DataFrame.pipe` for Composable Logic |
| 276 | +~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ |
| 277 | + |
| 278 | +Another useful pattern for improving readability and composability, especially when mixing |
| 279 | +vectorized logic with UDFs, is to use the :meth:`DataFrame.pipe` method. |
| 280 | + |
| 281 | +:meth:`DataFrame.pipe` doesn't improve performance directly, but it enables cleaner |
| 282 | +method chaining by passing the entire object into a function. This is especially helpful |
| 283 | +when chaining custom transformations: |
| 284 | + |
| 285 | +.. code-block:: python |
| 286 | +
|
| 287 | + def add_ratio_column(df): |
| 288 | + df["ratio"] = 100 * (df["one"] / df["two"]) |
| 289 | + return df |
| 290 | +
|
| 291 | + df = ( |
| 292 | + df |
| 293 | + .query("one > 0") |
| 294 | + .pipe(add_ratio_column) |
| 295 | + .dropna() |
| 296 | + ) |
| 297 | +
|
| 298 | +This is functionally equivalent to calling ``add_ratio_column(df)``, but keeps your code |
| 299 | +clean and composable. The function you pass to :meth:`DataFrame.pipe` can use vectorized operations, |
| 300 | +row-wise UDFs, or any other logic; :meth:`DataFrame.pipe` is agnostic. |
| 301 | + |
| 302 | +.. note:: |
| 303 | + While :meth:`DataFrame.pipe` does not improve performance on its own, |
| 304 | + it promotes clean, modular design and allows both vectorized and UDF-based logic |
| 305 | + to be composed in method chains. |
0 commit comments