-
-
Notifications
You must be signed in to change notification settings - Fork 34
SLEP018 Pandas output for transformers with set_output #68
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Changes from 7 commits
f140b8a
e708c2a
7d1528e
1186280
ac21785
326fcbb
d91fb3a
218b76a
2d23470
eac9f9f
add7a8a
0580391
4370736
1d57415
68ada33
File filter
Filter by extension
Conversations
Jump to
Diff view
Diff view
There are no files selected for viewing
Original file line number | Diff line number | Diff line change |
---|---|---|
|
@@ -21,6 +21,7 @@ | |
|
||
slep012/proposal | ||
slep013/proposal | ||
slep018/proposal | ||
|
||
.. toctree:: | ||
:maxdepth: 1 | ||
|
Original file line number | Diff line number | Diff line change | ||||
---|---|---|---|---|---|---|
@@ -0,0 +1,140 @@ | ||||||
.. _slep_018: | ||||||
|
||||||
======================================================= | ||||||
SLEP018: Pandas Output for Transformers with set_output | ||||||
======================================================= | ||||||
|
||||||
:Author: Thomas J. Fan | ||||||
:Status: Draft | ||||||
:Type: Standards Track | ||||||
:Created: 2022-05-23 | ||||||
|
||||||
Abstract | ||||||
-------- | ||||||
|
||||||
This SLEP proposes a ``set_output`` method to configure the output container of | ||||||
scikit-learn transformers. | ||||||
|
||||||
Detailed description | ||||||
-------------------- | ||||||
|
||||||
Currently, scikit-learn transformers return NumPy ndarrays or SciPy sparse matrices. | ||||||
This SLEP proposes adding a ``set_output`` method to configure a transformer to output | ||||||
pandas DataFrames:: | ||||||
|
||||||
scalar = StandardScalar().set_output(transform="pandas") | ||||||
scalar.fit(X_df) | ||||||
|
||||||
# X_trans_df is a pandas DataFrame | ||||||
X_trans_df = scalar.transform(X_df) | ||||||
|
||||||
The index of the output DataFrame must match the index of the input. For this | ||||||
SLEP, ``set_output`` will only configure the output for dense data. If a | ||||||
transformer returns sparse data, then ``transform`` will error if ``set_output`` | ||||||
is to "pandas". If a transformer always returns sparse data, then calling | ||||||
thomasjpfan marked this conversation as resolved.
Show resolved
Hide resolved
|
||||||
`set_output="pandas"` may raise an error. | ||||||
|
||||||
For a pipeline, calling ``set_output`` on the pipeline will configure all steps in the | ||||||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Are there meaningful cases with an intermediate sparse representation (which would not really call for sparse data with column names)??? There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. In the context of For If we want to allow for intermediate sparse representations without names, there is a future extension to this SLEP with |
||||||
pipeline:: | ||||||
|
||||||
num_prep = make_pipeline(SimpleImputer(), StandardScalar(), PCA()) | ||||||
num_preprocessor.set_output(transform="pandas") | ||||||
|
||||||
# X_trans_df is a pandas DataFrame | ||||||
X_trans_df = num_preprocessor.fit_transform(X_df) | ||||||
|
||||||
thomasjpfan marked this conversation as resolved.
Show resolved
Hide resolved
|
||||||
Meta-estimators that support ``set_output`` are required to configure all estimators | ||||||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. All estimators? All transformers? There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. In the context of this SLEP, it does make sense to reduce the scope to "all transformers". I was thinking about a future where we had |
||||||
by calling ``set_output``. | ||||||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. What is the procedure when this attribute is unavailable on a child estimator? Presumably the child should raise a There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. If In both cases, the parent is not properly configured because one of the children failed the There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Should this case be in the slep? There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Part of this is is in the SLEP here: https://github.com/thomasjpfan/enhancement_proposals/blob/218b76ad5612663ab9408dd4d824df357dda4d05/slep018/proposal.rst?plain=1#L76-L78 but it makes sense to move it up to this section. |
||||||
|
||||||
Global Configuration | ||||||
.................... | ||||||
|
||||||
This SLEP proposes a global configuration flag that sets the output for | ||||||
all transformers:: | ||||||
|
||||||
import sklearn | ||||||
sklearn.set_config(transform_output="pandas") | ||||||
|
||||||
The global default configuration is ``"default"`` where the estimator determines | ||||||
the output container. | ||||||
|
||||||
Implementation | ||||||
-------------- | ||||||
|
||||||
A prototype implementation was created to showcase different use cases for this SLEP, | ||||||
which is seen in | ||||||
`this rendered notebook <https://nbviewer.org/github/thomasjpfan/pandas-prototype-demo/blob/main/index.ipynb>`__ | ||||||
and | ||||||
`this interactive notebook <https://colab.research.google.com/github/thomasjpfan/pandas-prototype-demo/blob/main/index.ipynb>`__. | ||||||
|
||||||
|
||||||
Backward compatibility | ||||||
---------------------- | ||||||
|
||||||
There are no backward compatibility concerns, because the ``set_output`` method | ||||||
is a new API. Third party estimators can opt-in to the API by defining | ||||||
``set_output``. Meta-estimators that define ``set_output`` to configure | ||||||
it's inner estimators with ``set_output`` should error if any of the inner | ||||||
estimators do not define ``set_output``. | ||||||
|
||||||
Alternatives | ||||||
------------ | ||||||
|
||||||
Alternatives to this SLEP includes: | ||||||
|
||||||
1. `SLEP014 <https://github.com/scikit-learn/enhancement_proposals/pull/37>`__ | ||||||
proposes that if the input is a DataFrame than the output is a DataFrame. | ||||||
2. :ref:`SLEP012 <slep_012>` proposes a custom scikit-learn container | ||||||
for dense and sparse data that contains feature names. This SLEP | ||||||
also proposes a custom container for sparse data, but pandas for dense data. | ||||||
3. Prototype `#20100 <https://github.com/scikit-learn/scikit-learn/pull/20100>`__ | ||||||
showcases ``array_out="pandas"`` in `transform`. This API | ||||||
is limited because does not directly support fitting on a pipeline where the | ||||||
steps requires data frames input. | ||||||
|
||||||
Discussion | ||||||
---------- | ||||||
|
||||||
A list of issues discussing Pandas output are: | ||||||
`#14315 <https://github.com/scikit-learn/scikit-learn/pull/14315>`__, | ||||||
`#20100 <https://github.com/scikit-learn/scikit-learn/pull/20100>`__, and | ||||||
`#23001 <https://github.com/scikit-learn/scikit-learn/issueas/23001>`__. | ||||||
|
||||||
Future Extensions | ||||||
----------------- | ||||||
|
||||||
thomasjpfan marked this conversation as resolved.
Show resolved
Hide resolved
|
||||||
Sparse Data | ||||||
........... | ||||||
|
||||||
The Pandas DataFrame is not suitable to provide column names because it has | ||||||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more.
Suggested change
|
||||||
performance issues as shown in | ||||||
`#16772 <https://github.com/scikit-learn/scikit-learn/pull/16772#issuecomment-615423097>`__. | ||||||
A future extension to this SLEP is to have a ``"pandas_or_namedsparse"`` option. | ||||||
This option will use a scikit-learn specific sparse container that subclasses SciPy's | ||||||
sparse matrices. This sparse container includes the sparse data, feature names and | ||||||
index. This enables pipelines with Vectorizers without performance issues:: | ||||||
|
||||||
pipe = make_pipeline( | ||||||
CountVectorizer(), | ||||||
TfidfTransformer(), | ||||||
LogisticRegression(solver="liblinear") | ||||||
) | ||||||
pipe.set_output(transform="pandas_or_namedsparse") | ||||||
|
||||||
# feature names for logistic regression | ||||||
pipe[-1].feature_names_in_ | ||||||
|
||||||
References and Footnotes | ||||||
------------------------ | ||||||
|
||||||
.. [1] Each SLEP must either be explicitly labeled as placed in the public | ||||||
domain (see this SLEP as an example) or licensed under the `Open | ||||||
Publication License`_. | ||||||
|
||||||
.. _Open Publication License: https://www.opencontent.org/openpub/ | ||||||
|
||||||
|
||||||
Copyright | ||||||
--------- | ||||||
|
||||||
This document has been placed in the public domain. [1]_ |
Uh oh!
There was an error while loading. Please reload this page.