scikit-learn · amueller · Jul 17, 2022 · May 26, 2022 · Jun 9, 2022 · Jun 9, 2022
diff --git a/index.rst b/index.rst
@@ -21,6 +21,7 @@
 
     slep012/proposal
     slep013/proposal
+    slep018/proposal
 
 .. toctree::
     :maxdepth: 1

diff --git a/slep018/proposal.rst b/slep018/proposal.rst
@@ -0,0 +1,140 @@
+.. _slep_018:
+
+=======================================================
+SLEP018: Pandas Output for Transformers with set_output
+=======================================================
+
+:Author: Thomas J. Fan
+:Status: Draft
+:Type: Standards Track
+:Created: 2022-05-23
+
+Abstract
+--------
+
+This SLEP proposes a ``set_output`` method to configure the output container of
+scikit-learn transformers.
+
+Detailed description
+--------------------
+
+Currently, scikit-learn transformers return NumPy ndarrays or SciPy sparse matrices.
+This SLEP proposes adding a ``set_output`` method to configure a transformer to output
+pandas DataFrames::
+
+   scalar = StandardScalar().set_output(transform="pandas")
+   scalar.fit(X_df)
+
+   # X_trans_df is a pandas DataFrame
+   X_trans_df = scalar.transform(X_df)
+
+The index of the output DataFrame must match the index of the input. For this
+SLEP, ``set_output`` will only configure the output for dense data. If a
+transformer returns sparse data, then ``transform`` will error if ``set_output``
+is to "pandas". If a transformer always returns sparse data, then calling
+`set_output="pandas"` may raise an error.
+
+For a pipeline, calling ``set_output`` on the pipeline will configure all steps in the
+pipeline::
+
+   num_prep = make_pipeline(SimpleImputer(), StandardScalar(), PCA())
+   num_preprocessor.set_output(transform="pandas")
+
+   # X_trans_df is a pandas DataFrame
+   X_trans_df = num_preprocessor.fit_transform(X_df)
+
+Meta-estimators that support ``set_output`` are required to configure all estimators
+by calling ``set_output``.
+
+Global Configuration
+....................
+
+This SLEP proposes a global configuration flag that sets the output for
+all transformers::
+
+   import sklearn
+   sklearn.set_config(transform_output="pandas")
+
+The global default configuration is ``"default"`` where the estimator determines
+the output container.
+
+Implementation
+--------------
+
+A prototype implementation was created to showcase different use cases for this SLEP,
+which is seen in
+`this rendered notebook <https://nbviewer.org/github/thomasjpfan/pandas-prototype-demo/blob/main/index.ipynb>`__
+and
+`this interactive notebook <https://colab.research.google.com/github/thomasjpfan/pandas-prototype-demo/blob/main/index.ipynb>`__.
+
+
+Backward compatibility
+----------------------
+
+There are no backward compatibility concerns, because the ``set_output`` method
+is a new API. Third party estimators can opt-in to the API by defining
+``set_output``. Meta-estimators that define ``set_output`` to configure
+it's inner estimators with ``set_output`` should error if any of the inner
+estimators do not define ``set_output``.
+
+Alternatives
+------------
+
+Alternatives to this SLEP includes:
+
+1. `SLEP014 <https://github.com/scikit-learn/enhancement_proposals/pull/37>`__
+   proposes that if the input is a DataFrame than the output is a DataFrame.
+2. :ref:`SLEP012 <slep_012>` proposes a custom scikit-learn container
+   for dense and sparse data that contains feature names. This SLEP
+   also proposes a custom container for sparse data, but pandas for dense data.
+3. Prototype `#20100 <https://github.com/scikit-learn/scikit-learn/pull/20100>`__
+   showcases ``array_out="pandas"`` in `transform`. This API
+   is limited because does not directly support fitting on a pipeline where the
+   steps requires data frames input.
+
+Discussion
+----------
+
+A list of issues discussing Pandas output are:
+`#14315 <https://github.com/scikit-learn/scikit-learn/pull/14315>`__,
+`#20100 <https://github.com/scikit-learn/scikit-learn/pull/20100>`__, and
+`#23001 <https://github.com/scikit-learn/scikit-learn/issueas/23001>`__.
+
+Future Extensions
+-----------------
+
+Sparse Data
+...........
+
+The Pandas DataFrame is not suitable to provide column names because it has
-The Pandas DataFrame is not suitable to provide column names because it has
+The Pandas DataFrame is not suitable to provide column names for sparse data because it has
-The Pandas DataFrame is not suitable to provide column names because it has
+The Pandas DataFrame is not suitable to provide column names for sparse data because it has
+performance issues as shown in
+`#16772 <https://github.com/scikit-learn/scikit-learn/pull/16772#issuecomment-615423097>`__.
+A future extension to this SLEP is to have a ``"pandas_or_namedsparse"`` option.
+This option will use a scikit-learn specific sparse container that subclasses SciPy's
+sparse matrices. This sparse container includes the sparse data, feature names and
+index. This enables pipelines with Vectorizers without performance issues::
+
+   pipe = make_pipeline(
+      CountVectorizer(),
+      TfidfTransformer(),
+      LogisticRegression(solver="liblinear")
+   )
+   pipe.set_output(transform="pandas_or_namedsparse")
+
+   # feature names for logistic regression
+   pipe[-1].feature_names_in_
+
+References and Footnotes
+------------------------
+
+.. [1] Each SLEP must either be explicitly labeled as placed in the public
+   domain (see this SLEP as an example) or licensed under the `Open
+   Publication License`_.
+
+.. _Open Publication License: https://www.opencontent.org/openpub/
+
+
+Copyright
+---------
+
+This document has been placed in the public domain. [1]_