Skip to content

PDEP-9: Allow third-party projects to register pandas connectors with a standard API #51799

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 21 commits into from
Jun 13, 2023
Merged
Changes from 3 commits
Commits
Show all changes
21 commits
Select commit Hold shift + click to select a range
5dbdde9
PDEP-9: pandas I/O connectors as extensions
datapythonista Feb 27, 2023
730df18
Merge remote-tracking branch 'upstream/main' into pdep9
datapythonista Mar 5, 2023
23b934f
Final draft to be proposed
datapythonista Mar 5, 2023
de3a17b
Merge remote-tracking branch 'upstream/main' into pdep9
datapythonista Mar 7, 2023
da784ec
Address comments from code reviews, mostly by extending the proposal …
datapythonista Mar 7, 2023
f475350
Merge remote-tracking branch 'upstream/main' into pdep9
datapythonista Apr 6, 2023
4a8ba96
Keep current I/O API and allow pandas as an interface
datapythonista Apr 6, 2023
6ad6a9d
Merge remote-tracking branch 'upstream/main' into pdep9
datapythonista Apr 7, 2023
5cb47d9
Rejecting
datapythonista Apr 7, 2023
68ca3de
Reorder interfaces
datapythonista Apr 7, 2023
150d1d1
Update web/pandas/pdeps/0009-io-extensions.md
datapythonista Apr 29, 2023
6eea8a8
Use dataframe interchange protocol
datapythonista May 30, 2023
5665dc7
Merge branch 'pdep9' of github.com:datapythonista/pandas into pdep9
datapythonista May 30, 2023
40ebacc
typo
datapythonista May 30, 2023
aed569f
Merge branch 'main' into pdep9
datapythonista May 30, 2023
eb7c6f0
Make users load modules explicitly
datapythonista May 30, 2023
14a2f4a
Merge branch 'pdep9' of github.com:datapythonista/pandas into pdep9
datapythonista May 30, 2023
8050853
Update web/pandas/pdeps/0009-io-extensions.md
datapythonista Jun 7, 2023
5cb23dd
Add limitations section
datapythonista Jun 7, 2023
2af8577
Merge remote-tracking branch 'upstream/main' into pdep9
datapythonista Jun 13, 2023
ccb9674
Rejecting PDEP
datapythonista Jun 13, 2023
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
270 changes: 270 additions & 0 deletions web/pandas/pdeps/0009-io-extensions.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,270 @@
# PDEP-9: Allow third-party projects to register pandas connectors with a standard API

- Created: 5 March 2023
- Status: Draft
- Discussion: [#XXXX](https://github.com/pandas-dev/pandas/pull/XXXX)
- Author: [Marc Garcia](https://github.com/datapythonista)
- Revision: 1

## PDEP Summary

This document proposes that third-party projects implementing I/O or memory
connectors, can register them using Python's entrypoint system, and make them
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

remove comma

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for reporting all the typos/grammar corrections. Very helpful, I fixed all them!

available to pandas users with a standard interface in a dedicated namespace
`DataFrame.io`. For example:

```python
import pandas

df = pandas.DataFrame.io.from_duckdb("SELECT * FROM 'dataset.parquet';")

df.io.to_hive(hive_conn, "hive_table")
```

## Current state

pandas supports importing and exporting data from different formats using
I/O connectors, currently implemented in `pandas/io`, as well as connectors
to in-memory structure, like Python structures or other library formats.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

"structures" not "structure"

In many cases, those connectors wrap an existing Python library, while in
some others, pandas implements the logic to read and write to a particular
format.

In some cases, different engines exist for the same format. The API to use
those connectors is `pandas.read_<format>(engine='<engine-name>', ...)` to
import data, and `DataFrame.to_<format>(engine='<engine-name>', ...)` to
export data.

For objects exported to memory (like a Python dict) the API is the same as
for I/O, `DataFrame.to_<format>(...)`. For formats imported from objects in
memory, the API is different using the `from_` prefix instead of `read_`,
`DataFrame.from_<format>(...)`.

In some cases, the pandas API provides `DataFrame.to_*` methods that are not
used to export the data to a disk or memory object, but instead to transform
the index of a `DataFrame`: `DataFrame.to_period` and `DataFrame.to_timestamp`.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

related: ive recently been thinking that DataFrame/Series methods that only operate on the index/columns might make sense to put in a accessor/namespace


Dependencies of the connectors are not loaded by default, and will be
imported when the connector is used. If the dependencies are not installed,
an `ImportError` is raised.

```python
>>> pandas.read_gbq(query)
Traceback (most recent call last):
...
ImportError: Missing optional dependency 'pandas-gbq'.
pandas-gbq is required to load data from Google BigQuery.
See the docs: https://pandas-gbq.readthedocs.io.
Use pip or conda to install pandas-gbq.
```

### Supported formats

The list of formats can be found in the
[IO guide](https://pandas.pydata.org/docs/dev/user_guide/io.html).
A more detailed table, including in memory objects, and I/O connectors in the
DataFrame styler is presented next:

| Format | Reader | Writer |
|--------------|--------|--------|
| CSV | X | X |
| FWF | X | |
| JSON | X | X |
| HTML | X | X |
| LaTeX | | X |
| XML | X | X |
| Clipboard | X | X |
| Excel | X | X |
| HDF5 | X | X |
| Feather | X | X |
| Parquet | X | X |
| ORC | X | X |
| Stata | X | X |
| SAS | X | |
| SPSS | X | |
| Pickle | X | X |
| SQL | X | X |
| BigQuery | | |
| dict | X | X |
| records | X | X |
| string | | X |
| markdown | | X |
| xarray | | X |

At the time of writing this document, the `io/` module contains
close to 100,000 lines of Python, C and Cython code.

There is no objective criteria for when a format is included
in pandas, and the list above is mostly the result of a developer
being interested in implementing the connectors for a certain
format in pandas.

The number of existing formats available for data that can be processed with
pandas is constantly increasing, and its difficult for pandas to keep up to
date even with popular formats. It could possibly make sense to have connectors
to PyArrow, PySpark, Iceberg, DuckDB, Hive, Polars, and many others.

At the same time, some of the formats are not frequently used as shown in the
[2019 user survey](https://pandas.pydata.org//community/blog/2019-user-survey.html).
Those less popular formats include SPSS, SAS, Google BigQuery and
Stata. Note that only I/O formats (and not memory formats like records or xarray)
where included in the survey.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

"were included", not "where included"


The maintenance cost of supporting all formats is not only in maintaining the
code and reviewing pull requests, but also it has a significant cost in time
spent on CI systems installing dependencies, compiling code, running tests, etc.

In some cases, the main maintainers of some of the connectors are not part of
the pandas core development team, but people specialized in one of the formats
without commit rights.

## Proposal

While the current pandas approach has worked reasonably well, it is difficult
to find a stable solution where the maintenance incurred in pandas is not
too big, while at the same time users can interact with all different formats
and representations they are interested in, in an easy and intuitive way.

Third-party packages are already able to implement connectors to pandas, but
there are some limitations to it:

- Given the large number of formats supported by pandas itself, third-party
connectors are likely seen as second class citizens, not important enough
to be used, or not well supported.
- There is no standard API for I/O connectors, and users of them need to learn
each of them individually.
- Method chaining, is not possible with third-party I/O connectors to export
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

remove comma

data, unless authors monkey patch the `DataFrame` class, which should not
be encouraged.

This document proposes to open the development of pandas I/O connectors to
third-party libraries in a standard way that overcomes those limitations.

### Proposal implementation

Implementing this proposal would not require major changes to pandas, and
the API defined next would be used.

A new `.io` accessor would be created for the `DataFrame` class, where all
I/O connector methods from third-parties would be loaded. Nothing else would
live under that namespace.

Third-party packages would implement a
[setuptools entrypoint](https://setuptools.pypa.io/en/latest/userguide/entry_point.html#entry-points-for-plugins)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Small nit: entrypoints are not (or no longer) setuptools specific, a possible general reference: https://packaging.python.org/en/latest/specifications/entry-points/ or https://docs.python.org/3/library/importlib.metadata.html

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good point, I'll refer it as simply entrypoint or Python entrypoint.

to define the connectors that they implement, under a group `dataframe.io`.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe pandas.dataframe.io so it's scoped specially to pandas?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I was intentionally not making specific to pandas. :)

One of the nice things of this interface is that connectors could be reused by any software. From other dataframe libraries, to databases, to plotting libraries, to your own system that works directly with Arrow/pandas...

Imagine a connector is implemented for SAS files, and the connector returns an Arrow table. This could be reused by even a spreadsheet that ones to support users loading data from SAS.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

what you're describing is an extension API for pyarrow


For example, a hypothetical project `pandas_duckdb` implementing a `from_duckdb`
function, could use `pyproject.toml` to define the next entry point:

```toml
[project.entry-points."dataframe.io"]
from_duckdb = "pandas_duckdb:from_duckdb"
```

On import of the pandas module, it would read the entrypoint registry for the
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This would make import pandas slower, not sure if it can be done in a lazy way. For people using autocomplete when coding for example, I think the DataFrame.io should be populated before __getattr__ is called. But maybe it can just be populated when any of __dir__ or __getattr__ is called, and we get the best of both worlds? Or too complex?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

any idea how big this slowdown would be?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I guess it can be minimal if the user has just couple of connectors with lightweight dependencies installed. But there could possibly be a situation where dozens of connectors exist, and some of them are slow to load (maybe they have imports on their dependencies that are also slow, or other reasons). Unless connectors implement lazy imports, the loading of all connectors including their dependencies would happen at import pandas.

`dataframe.io` group, and would dynamically create methods in the `DataFrame.io`
namespace for them. Method names would only be allowed to start with `from_` or
`to_`, and any other prefix would make pandas raise an exception. This would
guarantee a reasonably consistent API among third-party I/O connectors.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

How would conflicts work if there are two packages that want to register to_widget?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Very good point. A conflict only would happen if a user has both packages installed, so I guess in most cases wouldn't be a problem, even if two authors decide to use the same name.

I guess entrypoints don't require names to be unique, so I think it's up to us to decide. I'd personally raise an exception and let the user know that they installed two packages providing the same method, and they should uninstall one to disambiguate.


Connectors would use Apache Arrow as the only interface to load data from and
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

i get that you're excited about pyarrow support, but i am not on board with getting rid of numpy support

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I didn't phrase this properly. I'll rephrase, but let me elaborate here for now. The idea is that third-party connectors produce an Arrow object, so we don't need to worry if they are using internal pandas APIs to build their data directly as pandas dataframes, that we later may change and break their code. But the fact that they communicate with us providing data as Arrow doesn't mean that when we then create the dataframe with it, we need to use Arrow as the backend. I think for now it makes more sense to build the dataframes with the stable types as we've been doing (unless the user sets the option dtype_backend to pyarrow, as discussed in #51433). Sorry this isn't clear in the proposal. Does this make sense?

to pandas. This would prevent that changes to the pandas API affecting
connectors in any way. This would simplify the development of connectors,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't understand this sentence: "This would prevent that changes to the pandas API affecting
connectors in any way. "

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I rephrased the whole paragraph, thanks for letting me know.

make testing of them much more reliable and resistant to changes in pandas, and
allow connectors to be reused by other projects of the ecosystem.

In case a `from_` method returned something different than a PyArrow table,
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@ritchie46 would be great to get your feedback in the whole proposal, but in particular to this part. Assuming Polars wants to benefit from generic readers and writers that can work with any dataframe project, what do you think about a PyArrow object? Since Polars is Arrow2 instead, would it make sense to make the readers/writers just expose the info to access the lower levels structs that the Arrow standards provide to exachange information? Not sure what the best option would be here.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pyarrow objects are perfect. We adhere to the arrow memory specification, the actual implementation of that specification doesn't really matter.

So yeah, pyarrow.table is good. 👍

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @ritchie46, that sounds great. Just to be clear, my concern wasn't that you couldn't read Arrow objects, but that you may not want to have PyArrow as a dependency to be able to use this API, and I think returning a PyArrow table forces that. PyArrow has more than 30 dependencies (at least in conda-forge), some of them not so small, like libarrow itself, or libgoogle-cloud.

I guess PyArrow may be the best to start, but I wonder if maybe a minimal Python Arrow library that can hold the metadata to access the Arrow data, and then be converted to the PyArrow or Arrow2/Polars structure would be better for the long term.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just a small plug for 'nanoarrow', which just had a 0.1 release, has R bindings, and could have Python bindings (e.g., apache/arrow-nanoarrow#117 ).

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is amazing, thanks for sharing @paleolimbot. I think this is exactly what we need, way better than using PyArrow and requiring it as a dependency for this connector subsystem! Just in time, let me know if at some point you need feedback or some testing of the Python bindings when they're ready. :)

pandas would raise an exception. pandas would expect all `to_` methods to have
`table: pyarrow.Table` as the first parameter, and it would raise an exception
otherwise. The `table` parameter would be exposed as the `self` parameter in
pandas, when the original function is registered as a method of the `.io`
accessor.

### Connector examples

This section lists specific examples of connectors that could immediately
benefit from this proposal.

**PyArrow** currently provides `Table.from_pandas` and `Table.to_pandas`.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

need to get clarify an earlier comment about allowing DataFrame as the interchange object both ways for this to make sense.

With the new interface, it could also register `DataFrame.from_pyarrow`
and `DataFrame.to_pyarrow`, so pandas users can use the converters with
the interface they are used to, when PyArrow is installed in the environment.

_Current API_:

```python
pyarrow.Table.from_pandas(table.to_pandas()
.query('my_col > 0'))
```

_Proposed API_:

```python
(pandas.DataFrame.io.from_pyarrow(table)
.query('my_col > 0')
.io.to_pyarrow())
Copy link
Member

@simonjayhawkins simonjayhawkins Apr 12, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
(pandas.DataFrame.io.from_pyarrow(table)
.query('my_col > 0')
.io.to_pyarrow())
(pandas.from_pyarrow(table)
.query('my_col > 0')
.to_pyarrow())

```

**Polars**, **Vaex** and other dataframe frameworks could benefit from
third-party projects that make the interoperability with pandas use a
more explicitly API.

_Current API_:

```python
polars.DataFrame(df.to_pandas()
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This example is so similar to the above that I'm not sure it adds value. Can they be combined?

.query('my_col > 0'))
```

_Proposed API_:

```python
(pandas.DataFrame.io.from_polars(df)
.query('my_col > 0')
.io.to_polars())
Copy link
Member

@simonjayhawkins simonjayhawkins Apr 12, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
(pandas.DataFrame.io.from_polars(df)
.query('my_col > 0')
.io.to_polars())
(pandas.from_polars(df)
.query('my_col > 0')
.to_polars())

```

**DuckDB** provides an out-of-core engine able to push predicates before
the data is loaded, making much better use of memory and significantly
decreasing loading time. pandas, because of its eager nature is not able
to easily implement this itself, but could benefit from a DuckDB loader.
The loader can already be implemented inside pandas, or as a third-party
extension with an arbitrary API. But this proposal would let the creation
of a third-party extension with a standard and intuitive API:

```python
pandas.DataFrame.io.from_duckdb("SELECT *
FROM 'dataset.parquet'
WHERE my_col > 0")
```

**Big data** systems such as Hive, Iceberg, Presto, etc. could benefit
from a standard way to load data to pandas. Also regular **SQL databases**
that can return their query results as Arrow, would benefit from better
and faster connectors than the existing ones based on SQL Alchemy and
Python structures.

Any other format, including **domain-specific formats** could easily
implement pandas connectors with a clear an intuitive API.

## Proposal extensions

The scope of the current proposal is limited to the addition of the
`DataFrame.io` namespace, and the automatic registration of functions defined
by third-party projects, if an entrypoint is defined.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I agree on making this out of scope for the proposal itself, however I think it's also natural to discuss what the next steps may be when reasoning about this change. So I hope it's okay that the steps beyond this are still okay to discuss here.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Absolutely. I think it makes more sense to make a final decision on possible migrations at a later point, when we have more information about this new system. But fully agree that it's worth discussing, and having an idea of other people expectations regarding this.

To me, some of the current functions, such as read_sas, read_spss, read_gbq, to_xarray and maybe others could be moved to this API as soon as the new system is in place, and make the current API raise a FutureWarning to let users know about the change.

For the more commonly used, I don't have a strong opinion. Maybe make them available via the new API too kind of soon, for users who want a more consistent API, but wait until any FutureWarning or anything?


Any changes to the current I/O of pandas are out of scope for this proposal,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe be more explicit here that it is not just "current I/O", but includes methods like to_json() and from_json(), etc. So you could say that migrations of the current API to the DataFrame.io namespace are not part of this PDEP.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Rephrased, I think it should be more clear now.

but the next tasks can be considered for future work and proposals:

- Migrate I/O connectors currently implemented in pandas to the new interface.
This would require a transition period where users would be warned that
existing `DataFrame.read_*` may have been moved to `DataFrame.io.from_*`,
Copy link
Member

@MarcoGorelli MarcoGorelli Mar 17, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

should this have been

Suggested change
existing `DataFrame.read_*` may have been moved to `DataFrame.io.from_*`,
existing `pandas.read_*` may have been moved to `DataFrame.io.from_*`,

?

and that the old API will stop working in a future version.
- Move out of the pandas repository and into their own third-party projects
some of the existing I/O connectors.
- Implement with the new interface some of the data structures that the
`DataFrame` constructor accepts.

## PDEP-9 History

- 5 March 2023: Initial version