Skip to content

Commit 5bd1c9c

Browse files
authored
Merge pull request #68 from thomasjpfan/pandas_out
SLEP018 Pandas output for transformers with set_output
2 parents c86f619 + 68ada33 commit 5bd1c9c

File tree

2 files changed

+137
-0
lines changed

2 files changed

+137
-0
lines changed

index.rst

Lines changed: 1 addition & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -21,6 +21,7 @@
2121

2222
slep012/proposal
2323
slep013/proposal
24+
slep018/proposal
2425

2526
.. toctree::
2627
:maxdepth: 1

slep018/proposal.rst

Lines changed: 136 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,136 @@
1+
.. _slep_018:
2+
3+
=======================================================
4+
SLEP018: Pandas Output for Transformers with set_output
5+
=======================================================
6+
7+
:Author: Thomas J. Fan
8+
:Status: Draft
9+
:Type: Standards Track
10+
:Created: 2022-06-22
11+
12+
Abstract
13+
--------
14+
15+
This SLEP proposes a ``set_output`` method to configure the output data container of
16+
scikit-learn transformers.
17+
18+
Detailed description
19+
--------------------
20+
21+
Currently, scikit-learn transformers return NumPy ndarrays or SciPy sparse
22+
matrices. This SLEP proposes adding a ``set_output`` method to configure a
23+
transformer to output pandas DataFrames::
24+
25+
scalar = StandardScalar().set_output(transform="pandas")
26+
scalar.fit(X_df)
27+
28+
# X_trans_df is a pandas DataFrame
29+
X_trans_df = scalar.transform(X_df)
30+
31+
The index of the output DataFrame must match the index of the input. If the
32+
transformer does not support ``transform="pandas"``, then it must raise a
33+
``ValueError`` stating that it does not support the feature.
34+
35+
This SLEP's only focus is dense data for ``set_output``. If a transformer returns
36+
sparse data, e.g. `OneHotEncoder(sparse=True), then ``transform`` will raise a
37+
``ValueError`` if ``set_output(transform="pandas")``. Dealing with sparse output
38+
might be the scope of another future SLEP.
39+
40+
For a pipeline, calling ``set_output`` on the pipeline will configure all steps
41+
in the pipeline::
42+
43+
num_prep = make_pipeline(SimpleImputer(), StandardScalar(), PCA())
44+
num_preprocessor.set_output(transform="pandas")
45+
46+
# X_trans_df is a pandas DataFrame
47+
X_trans_df = num_preprocessor.fit_transform(X_df)
48+
49+
# X_trans_df is again a pandas DataFrame
50+
X_trans_df = num_preprocessor[0].transform(X_df)
51+
52+
Meta-estimators that support ``set_output`` are required to configure all inner
53+
transformer by calling ``set_output``. Specifically all fitted and non-fitted
54+
inner transformers must be configured with ``set_output``. This enables
55+
``transform``'s output to be a DataFrame before and after the meta-estimator is
56+
fitted. If an inner transformer does not define ``set_output``, then an error is
57+
raised.
58+
59+
60+
Global Configuration
61+
....................
62+
63+
For ease of use, this SLEP proposes a global configuration flag that sets the output for all
64+
transformers::
65+
66+
import sklearn
67+
sklearn.set_config(transform_output="pandas")
68+
69+
The global default configuration is ``"default"`` where the transformer
70+
determines the output container.
71+
72+
The configuration can also be set locally using the ``config_context`` context
73+
manager:
74+
75+
from sklearn import config_context
76+
with config_context(transform_output="pandas"):
77+
num_prep = make_pipeline(SimpleImputer(), StandardScalar(), PCA())
78+
num_preprocessor.fit_transform(X_df)
79+
80+
The following specifies the precedence levels for the three ways to configure
81+
the output container:
82+
83+
1. Locally configure a transformer: ``transformer.set_output``
84+
2. Context manager: ``config_context``
85+
3. Global configuration: ``set_config``
86+
87+
Implementation
88+
--------------
89+
90+
A possible implementation of this SLEP is worked out in :pr:`23734`.
91+
92+
Backward compatibility
93+
----------------------
94+
95+
There are no backward compatibility concerns, because the ``set_output`` method
96+
is a new API. Third party transformers can opt-in to the API by defining
97+
``set_output``.
98+
99+
Alternatives
100+
------------
101+
102+
Alternatives to this SLEP includes:
103+
104+
1. `SLEP014 <https://github.com/scikit-learn/enhancement_proposals/pull/37>`__
105+
proposes that if the input is a DataFrame than the output is a DataFrame.
106+
2. Prototype `#20100
107+
<https://github.com/scikit-learn/scikit-learn/pull/20100>`__ showcases
108+
``array_out="pandas"`` in `transform`. This API is limited because does not
109+
directly support fitting on a pipeline where the steps requires data frames
110+
input.
111+
112+
Discussion
113+
----------
114+
115+
A list of issues discussing Pandas output are: `#14315
116+
<https://github.com/scikit-learn/scikit-learn/pull/14315>`__, `#20100
117+
<https://github.com/scikit-learn/scikit-learn/pull/20100>`__, and `#23001
118+
<https://github.com/scikit-learn/scikit-learn/issueas/23001>`__. This SLEP
119+
proposes configuring the output to be pandas because it is the DataFrame library
120+
that is most widely used and requested by users. The ``set_output`` can be
121+
extended to support support additional DataFrame libraries in the future.
122+
123+
References and Footnotes
124+
------------------------
125+
126+
.. [1] Each SLEP must either be explicitly labeled as placed in the public
127+
domain (see this SLEP as an example) or licensed under the `Open Publication
128+
License`_.
129+
130+
.. _Open Publication License: https://www.opencontent.org/openpub/
131+
132+
133+
Copyright
134+
---------
135+
136+
This document has been placed in the public domain. [1]_

0 commit comments

Comments
 (0)