-
-
Notifications
You must be signed in to change notification settings - Fork 34
SLEP018 Pandas output for transformers with set_output #68
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Changes from 10 commits
f140b8a
e708c2a
7d1528e
1186280
ac21785
326fcbb
d91fb3a
218b76a
2d23470
eac9f9f
add7a8a
0580391
4370736
1d57415
68ada33
File filter
Filter by extension
Conversations
Jump to
Diff view
Diff view
There are no files selected for viewing
Original file line number | Diff line number | Diff line change |
---|---|---|
|
@@ -21,6 +21,7 @@ | |
|
||
slep012/proposal | ||
slep013/proposal | ||
slep018/proposal | ||
|
||
.. toctree:: | ||
:maxdepth: 1 | ||
|
Original file line number | Diff line number | Diff line change | ||||
---|---|---|---|---|---|---|
@@ -0,0 +1,138 @@ | ||||||
.. _slep_018: | ||||||
|
||||||
======================================================= | ||||||
SLEP018: Pandas Output for Transformers with set_output | ||||||
======================================================= | ||||||
|
||||||
:Author: Thomas J. Fan | ||||||
:Status: Draft | ||||||
:Type: Standards Track | ||||||
:Created: 2022-06-22 | ||||||
|
||||||
Abstract | ||||||
-------- | ||||||
|
||||||
This SLEP proposes a ``set_output`` method to configure the output container of | ||||||
scikit-learn transformers. | ||||||
|
||||||
Detailed description | ||||||
-------------------- | ||||||
|
||||||
Currently, scikit-learn transformers return NumPy ndarrays or SciPy sparse | ||||||
matrices. This SLEP proposes adding a ``set_output`` method to configure a | ||||||
transformer to output pandas DataFrames:: | ||||||
|
||||||
scalar = StandardScalar().set_output(transform="pandas") | ||||||
scalar.fit(X_df) | ||||||
|
||||||
# X_trans_df is a pandas DataFrame | ||||||
X_trans_df = scalar.transform(X_df) | ||||||
|
||||||
The index of the output DataFrame must match the index of the input. If the | ||||||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. How is sparse data treated? If I do There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more.
Error in both cases.
I think it's strange not to error. If From a workflow point of view, the OHE will be in a pipeline, so it will end up to be one extra configuration ( preprocessor = ColumnTransformer([("cat", OneHotEncoder(sparse=True), ...])
pipeline = make_pipeline(preprocessor, ...)
pipeline.set_output(transform="pandas") There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. You're right, in a pipeline it's not so bad. It might be worth calling out in the OHE documentation or in the error message? I agree it's better to be explicit, but it might also be surprising to users who don't know that the default is |
||||||
transformer does not support ``transform="pandas"``, then it must raise a | ||||||
``ValueError`` stating that it does not support the feature. | ||||||
|
||||||
For this SLEP, ``set_output`` will only configure the output for dense data. If | ||||||
the transformer returns sparse data, then ``transform`` will raise a | ||||||
``ValueError`` if ``set_output(transform="pandas")``. | ||||||
thomasjpfan marked this conversation as resolved.
Show resolved
Hide resolved
|
||||||
|
||||||
For a pipeline, calling ``set_output`` on the pipeline will configure all steps | ||||||
in the pipeline:: | ||||||
|
||||||
num_prep = make_pipeline(SimpleImputer(), StandardScalar(), PCA()) | ||||||
num_preprocessor.set_output(transform="pandas") | ||||||
|
||||||
# X_trans_df is a pandas DataFrame | ||||||
X_trans_df = num_preprocessor.fit_transform(X_df) | ||||||
|
||||||
thomasjpfan marked this conversation as resolved.
Show resolved
Hide resolved
|
||||||
Meta-estimators that support ``set_output`` are required to configure all inner | ||||||
transformer by calling ``set_output``. If an inner transformer does not define | ||||||
``set_output``, then an error is raised. | ||||||
|
||||||
Global Configuration | ||||||
.................... | ||||||
|
||||||
This SLEP proposes a global configuration flag that sets the output for all | ||||||
thomasjpfan marked this conversation as resolved.
Show resolved
Hide resolved
|
||||||
transformers:: | ||||||
|
||||||
import sklearn | ||||||
sklearn.set_config(transform_output="pandas") | ||||||
|
||||||
The global default configuration is ``"default"`` where the transformer | ||||||
determines the output container. | ||||||
|
||||||
Implementation | ||||||
-------------- | ||||||
|
||||||
The implementation of this SLEP is in :pr:`23734`. | ||||||
thomasjpfan marked this conversation as resolved.
Show resolved
Hide resolved
|
||||||
|
||||||
Backward compatibility | ||||||
---------------------- | ||||||
|
||||||
There are no backward compatibility concerns, because the ``set_output`` method | ||||||
is a new API. Third party transformers can opt-in to the API by defining | ||||||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Do they have to opt-in to respect the global flag? There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. There is no backward compatibility concern as long as we don't change the default of There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. There isn't a backward compatibility concern, but there is an issue around if a third party transformer should respect the global flag. Concretely: sklearn.set_config(transform_output="pandas")
# Should we require this to be a dataframe?
third_party_transformer.transform(X_df) I'm leading toward letting the library decide if it wants to respect the global configuration. There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Exactly that was the case that seemed underspecified to me. I'm ok to leave it up to the library, which means we won't add it to the common tests that this errors out if it's not supported. Not sure if it's worth adding that to the doc as a sentence/half-sentence? Otherwise the doc doesn't really tell the third party estimator authors what they should be doing. |
||||||
``set_output``. | ||||||
|
||||||
Alternatives | ||||||
------------ | ||||||
|
||||||
Alternatives to this SLEP includes: | ||||||
|
||||||
1. `SLEP014 <https://github.com/scikit-learn/enhancement_proposals/pull/37>`__ | ||||||
proposes that if the input is a DataFrame than the output is a DataFrame. | ||||||
2. :ref:`SLEP012 <slep_012>` proposes a custom scikit-learn container for dense | ||||||
and sparse data that contains feature names. This SLEP also proposes a custom | ||||||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. The sparse data is now in future work, right? |
||||||
container for sparse data, but pandas for dense data. | ||||||
3. Prototype `#20100 | ||||||
<https://github.com/scikit-learn/scikit-learn/pull/20100>`__ showcases | ||||||
``array_out="pandas"`` in `transform`. This API is limited because does not | ||||||
directly support fitting on a pipeline where the steps requires data frames | ||||||
input. | ||||||
|
||||||
Discussion | ||||||
---------- | ||||||
|
||||||
A list of issues discussing Pandas output are: `#14315 | ||||||
<https://github.com/scikit-learn/scikit-learn/pull/14315>`__, `#20100 | ||||||
<https://github.com/scikit-learn/scikit-learn/pull/20100>`__, and `#23001 | ||||||
<https://github.com/scikit-learn/scikit-learn/issueas/23001>`__. | ||||||
|
||||||
Future Extensions | ||||||
----------------- | ||||||
|
||||||
thomasjpfan marked this conversation as resolved.
Show resolved
Hide resolved
|
||||||
Sparse Data | ||||||
........... | ||||||
|
||||||
The Pandas DataFrame is not suitable to provide column names because it has | ||||||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more.
Suggested change
|
||||||
performance issues as shown in `#16772 | ||||||
<https://github.com/scikit-learn/scikit-learn/pull/16772#issuecomment-615423097>`__. | ||||||
A future extension to this SLEP is to have a ``"pandas_or_namedsparse"`` option. | ||||||
This option will use a scikit-learn specific sparse container that subclasses | ||||||
SciPy's sparse matrices. This sparse container includes the sparse data, feature | ||||||
names and index. This enables pipelines with Vectorizers without performance | ||||||
issues:: | ||||||
|
||||||
pipe = make_pipeline( | ||||||
CountVectorizer(), | ||||||
TfidfTransformer(), | ||||||
LogisticRegression(solver="liblinear") | ||||||
) | ||||||
pipe.set_output(transform="pandas_or_namedsparse") | ||||||
|
||||||
# feature names for logistic regression | ||||||
pipe[-1].feature_names_in_ | ||||||
|
||||||
References and Footnotes | ||||||
------------------------ | ||||||
|
||||||
.. [1] Each SLEP must either be explicitly labeled as placed in the public | ||||||
domain (see this SLEP as an example) or licensed under the `Open Publication | ||||||
License`_. | ||||||
|
||||||
.. _Open Publication License: https://www.opencontent.org/openpub/ | ||||||
|
||||||
|
||||||
Copyright | ||||||
--------- | ||||||
|
||||||
This document has been placed in the public domain. [1]_ |
Uh oh!
There was an error while loading. Please reload this page.