Skip to content

SLEP018 Pandas output for transformers with set_output #68

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 15 commits into from
Jul 17, 2022
Merged
Show file tree
Hide file tree
Changes from 7 commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
1 change: 1 addition & 0 deletions index.rst
Original file line number Diff line number Diff line change
Expand Up @@ -21,6 +21,7 @@

slep012/proposal
slep013/proposal
slep018/proposal

.. toctree::
:maxdepth: 1
Expand Down
140 changes: 140 additions & 0 deletions slep018/proposal.rst
Original file line number Diff line number Diff line change
@@ -0,0 +1,140 @@
.. _slep_018:

=======================================================
SLEP018: Pandas Output for Transformers with set_output
=======================================================

:Author: Thomas J. Fan
:Status: Draft
:Type: Standards Track
:Created: 2022-05-23

Abstract
--------

This SLEP proposes a ``set_output`` method to configure the output container of
scikit-learn transformers.

Detailed description
--------------------

Currently, scikit-learn transformers return NumPy ndarrays or SciPy sparse matrices.
This SLEP proposes adding a ``set_output`` method to configure a transformer to output
pandas DataFrames::

scalar = StandardScalar().set_output(transform="pandas")
scalar.fit(X_df)

# X_trans_df is a pandas DataFrame
X_trans_df = scalar.transform(X_df)

The index of the output DataFrame must match the index of the input. For this
SLEP, ``set_output`` will only configure the output for dense data. If a
transformer returns sparse data, then ``transform`` will error if ``set_output``
is to "pandas". If a transformer always returns sparse data, then calling
`set_output="pandas"` may raise an error.

For a pipeline, calling ``set_output`` on the pipeline will configure all steps in the
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Are there meaningful cases with an intermediate sparse representation (which would not really call for sparse data with column names)???

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In the context of Pipeline, the only use case I think can of is if one has not use of feature names after the step that returns the sparse matrix without feature names.

For ColumnTransformer with a OneHotEncoder that outputs sparse data, OneHotEncoder is not required to output a "named sparse matrix" because ColumnTransformer can figure it out by calling get_feature_names_out.

If we want to allow for intermediate sparse representations without names, there is a future extension to this SLEP with set_output="pandas_or_sparse". This configures dense container to be dataframe, sparse container to be SciPy sparse.

pipeline::

num_prep = make_pipeline(SimpleImputer(), StandardScalar(), PCA())
num_preprocessor.set_output(transform="pandas")

# X_trans_df is a pandas DataFrame
X_trans_df = num_preprocessor.fit_transform(X_df)

Meta-estimators that support ``set_output`` are required to configure all estimators
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

All estimators? All transformers?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In the context of this SLEP, it does make sense to reduce the scope to "all transformers". I was thinking about a future where we had set_output(predict="pandas") for non-transformers.

by calling ``set_output``.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What is the procedure when this attribute is unavailable on a child estimator? Presumably the child should raise a ValueError if the value "pandas" is not supported. What should the parent do then?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If set_output is not defined for the child estimator, then the parent errors. If the child does not support "pandas", then the child should error.

In both cases, the parent is not properly configured because one of the children failed the set_output call. For a user point of view, they can not use set_output for their meta-estimator.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should this case be in the slep?

Copy link
Member Author

@thomasjpfan thomasjpfan Jun 21, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.


Global Configuration
....................

This SLEP proposes a global configuration flag that sets the output for
all transformers::

import sklearn
sklearn.set_config(transform_output="pandas")

The global default configuration is ``"default"`` where the estimator determines
the output container.

Implementation
--------------

A prototype implementation was created to showcase different use cases for this SLEP,
which is seen in
`this rendered notebook <https://nbviewer.org/github/thomasjpfan/pandas-prototype-demo/blob/main/index.ipynb>`__
and
`this interactive notebook <https://colab.research.google.com/github/thomasjpfan/pandas-prototype-demo/blob/main/index.ipynb>`__.


Backward compatibility
----------------------

There are no backward compatibility concerns, because the ``set_output`` method
is a new API. Third party estimators can opt-in to the API by defining
``set_output``. Meta-estimators that define ``set_output`` to configure
it's inner estimators with ``set_output`` should error if any of the inner
estimators do not define ``set_output``.

Alternatives
------------

Alternatives to this SLEP includes:

1. `SLEP014 <https://github.com/scikit-learn/enhancement_proposals/pull/37>`__
proposes that if the input is a DataFrame than the output is a DataFrame.
2. :ref:`SLEP012 <slep_012>` proposes a custom scikit-learn container
for dense and sparse data that contains feature names. This SLEP
also proposes a custom container for sparse data, but pandas for dense data.
3. Prototype `#20100 <https://github.com/scikit-learn/scikit-learn/pull/20100>`__
showcases ``array_out="pandas"`` in `transform`. This API
is limited because does not directly support fitting on a pipeline where the
steps requires data frames input.

Discussion
----------

A list of issues discussing Pandas output are:
`#14315 <https://github.com/scikit-learn/scikit-learn/pull/14315>`__,
`#20100 <https://github.com/scikit-learn/scikit-learn/pull/20100>`__, and
`#23001 <https://github.com/scikit-learn/scikit-learn/issueas/23001>`__.

Future Extensions
-----------------

Sparse Data
...........

The Pandas DataFrame is not suitable to provide column names because it has
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
The Pandas DataFrame is not suitable to provide column names because it has
The Pandas DataFrame is not suitable to provide column names for sparse data because it has

performance issues as shown in
`#16772 <https://github.com/scikit-learn/scikit-learn/pull/16772#issuecomment-615423097>`__.
A future extension to this SLEP is to have a ``"pandas_or_namedsparse"`` option.
This option will use a scikit-learn specific sparse container that subclasses SciPy's
sparse matrices. This sparse container includes the sparse data, feature names and
index. This enables pipelines with Vectorizers without performance issues::

pipe = make_pipeline(
CountVectorizer(),
TfidfTransformer(),
LogisticRegression(solver="liblinear")
)
pipe.set_output(transform="pandas_or_namedsparse")

# feature names for logistic regression
pipe[-1].feature_names_in_

References and Footnotes
------------------------

.. [1] Each SLEP must either be explicitly labeled as placed in the public
domain (see this SLEP as an example) or licensed under the `Open
Publication License`_.

.. _Open Publication License: https://www.opencontent.org/openpub/


Copyright
---------

This document has been placed in the public domain. [1]_