Compatibility with the new Arrow FileSystem implementations

**Background**

@martindurant as you know, last year we started developing new FileSystem implementations in the Apache Arrow project (https://issues.apache.org/jira/browse/ARROW-767, https://github.com/apache/arrow/pull/4225 is the PR with the initial abstract API, on which you gave feedback). Those developments have some implications for users using fsspec-compatible filesystems, and so as promised, with some delay, opening an issue here to discuss how to handle those implications (and since fsspec currently holds the pyarrow-compatibiliy layer, opening an issue here seems appropriate).

To summarize:

- The "old" filesystems are available under `pyarrow.filesystems` ([docs](https://arrow.apache.org/docs/python/filesystems_deprecated.html)). We basically only have a `LocalFileSystem` and `pa.hdfs.HadoopFileSystem` as concrete implementations. And in addition, there is the `DaskFileSystem` which is used by fsspec as base class, see more on that below.
- The "new" filesystems are available in the `pyarrow.fs` submodule ([docs](https://arrow.apache.org/docs/python/filesystems.html)). Those are python wrappers for the C++ implementations, and currently there are already concrete implementations for local, Hadoop and S3.

So an important difference is that the new filesystems are actual implementations in C++, and `pyarrow.fs` is only providing wrappers for those. This is done for good reasons: those filesystem are a shared implementation and are used by many different users of the Arrow project (and from C, C++, Python, R, Ruby, ..). Further, those filesystems are for example used in the Arrow Datasets project, which enables a bunch of new features in the ParquetDataset reading (and also enabled that you can now actually query a Parquet dataset from R). Those new filesystems have been an important part in moving the Arrow project forward.

But this also means that the filesystem that pyarrow functions expect is no longer an "interface" you can implement, but it actually needs a filesystem that wraps a C++ filesystem. 
(to be clear: all functionality that already existed before is right now still accepting the old filesystems, only the new `pyarrow.dataset` module already requires the new filesystems. But long term, we want to move consistently to the new filesystems). 

Concretely, this means that the feature of `fsspec` to automatically provide compatibility with payrrow will no longer work in the future:

> if installed, all file-system classes also subclass from `pyarrow.filesystem.FileSystem`, so can work with any arrow function expecting such an instance

This current compatibility means that eg pyarrow's `parquet.read_table/ParquetDataset` work with any fsspec filesystem.

---

**Concrete issues** 

Ideally, we want to keep compatibility for the existing user base that is using fsspec-based filesystems with pyarrow functionality, while at the same time internally in pyarrow moving completely to our new filesytem implementation. 
To achieve this, I currently see two (not necessarily mutually exclusive, to be clear) options:

* Implement a "conversion" for all important fsspec-based filesystems to a `pyarrow.fs` filesystem object (eg convert a `s3fs.S3FileSystem` instance with all its configuration into an equivalent `pyarrow.fs.S3FileSystem`).
  * I suppose this is what we will do for a LocalFileSystem. But for other filesystems, I don't know how faithful such conversions always can be (eg this might be tricky for things like S3? Can an S3 filesytem be fully encoded/roundtripped in an URI?)
  * This option of course has the pre-condition that we actually support the filesystem in question in pyarrow (which is currently limited, although we plan to expand this).
* Implement a "`pyarrow.fs` wrapper for `fsspec`", a C++ FileSystem that calls back into a python object for each of its methods (where this python object then could be any fsspec-compatible filesystem).
  * Such a "PythonCallBackFilesystem" would allow that pyarrow can actually use the fsspec-based filesystems without converting them. It would provide easy compatibility (easy for the user, to be clear ;)), at the cost of performance (compared to pyarrow's native filesystems)
  * We could wrap incoming fsspec-filesystems in pyarrow, or fsspec could use such a class as baseclass when pyarrow is installed similarly as is done now.
  * I didn't yet investigate the feasibility of this option, but opened [ARROW-8766](https://issues.apache.org/jira/browse/ARROW-8766) for it.

There is actually also a third option, and that is that some concrete fsspec implementations start to use one of the new pyarrow filesystems as its base, and then it would also be directly usable in pyarrow (but that's not my call to make, but up to those individual projects to be clear. For HDFS in `fsspec` that's probably what we want to do, though, since its implementation already depends on pyarrow). 

As mentioned above, those options are not necessarily mutually exclusive. It might depend on the specific filesystem which option is desirable / possible (and the second option could also be a fallback for the first option if pyarrow doesn't support the specific file system).

Thoughts on this? About the feasibility of the specific options? Other options?

---

Note: the above is about the use case of *"users can pass an fsspec filesystem to pyarrow"*. There is also the use case the other way around of *"using pyarrow filesystems where fsspec-compliant filesystems are expected"*. For this, an `fsspec`-compliant wrapper around a `pyarrow.fs` filesystem is probably useful (and I suppose this is something that could live in pyarrow). For this there is https://issues.apache.org/jira/browse/ARROW-7102 
Such a wrapper could also provide a richer API for those users who want that with the pyarrow filesystems (since the current pyarrow.fs filesystems are rather bare-bones)

cc @martindurant @TomAugspurger @pitrou

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Compatibility with the new Arrow FileSystem implementations #295

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Compatibility with the new Arrow FileSystem implementations #295

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions