Skip to content

Compatibility with the new Arrow FileSystem implementations #295

Closed
@jorisvandenbossche

Description

@jorisvandenbossche

Background

@martindurant as you know, last year we started developing new FileSystem implementations in the Apache Arrow project (https://issues.apache.org/jira/browse/ARROW-767, apache/arrow#4225 is the PR with the initial abstract API, on which you gave feedback). Those developments have some implications for users using fsspec-compatible filesystems, and so as promised, with some delay, opening an issue here to discuss how to handle those implications (and since fsspec currently holds the pyarrow-compatibiliy layer, opening an issue here seems appropriate).

To summarize:

  • The "old" filesystems are available under pyarrow.filesystems (docs). We basically only have a LocalFileSystem and pa.hdfs.HadoopFileSystem as concrete implementations. And in addition, there is the DaskFileSystem which is used by fsspec as base class, see more on that below.
  • The "new" filesystems are available in the pyarrow.fs submodule (docs). Those are python wrappers for the C++ implementations, and currently there are already concrete implementations for local, Hadoop and S3.

So an important difference is that the new filesystems are actual implementations in C++, and pyarrow.fs is only providing wrappers for those. This is done for good reasons: those filesystem are a shared implementation and are used by many different users of the Arrow project (and from C, C++, Python, R, Ruby, ..). Further, those filesystems are for example used in the Arrow Datasets project, which enables a bunch of new features in the ParquetDataset reading (and also enabled that you can now actually query a Parquet dataset from R). Those new filesystems have been an important part in moving the Arrow project forward.

But this also means that the filesystem that pyarrow functions expect is no longer an "interface" you can implement, but it actually needs a filesystem that wraps a C++ filesystem.
(to be clear: all functionality that already existed before is right now still accepting the old filesystems, only the new pyarrow.dataset module already requires the new filesystems. But long term, we want to move consistently to the new filesystems).

Concretely, this means that the feature of fsspec to automatically provide compatibility with payrrow will no longer work in the future:

if installed, all file-system classes also subclass from pyarrow.filesystem.FileSystem, so can work with any arrow function expecting such an instance

This current compatibility means that eg pyarrow's parquet.read_table/ParquetDataset work with any fsspec filesystem.


Concrete issues

Ideally, we want to keep compatibility for the existing user base that is using fsspec-based filesystems with pyarrow functionality, while at the same time internally in pyarrow moving completely to our new filesytem implementation.
To achieve this, I currently see two (not necessarily mutually exclusive, to be clear) options:

  • Implement a "conversion" for all important fsspec-based filesystems to a pyarrow.fs filesystem object (eg convert a s3fs.S3FileSystem instance with all its configuration into an equivalent pyarrow.fs.S3FileSystem).
    • I suppose this is what we will do for a LocalFileSystem. But for other filesystems, I don't know how faithful such conversions always can be (eg this might be tricky for things like S3? Can an S3 filesytem be fully encoded/roundtripped in an URI?)
    • This option of course has the pre-condition that we actually support the filesystem in question in pyarrow (which is currently limited, although we plan to expand this).
  • Implement a "pyarrow.fs wrapper for fsspec", a C++ FileSystem that calls back into a python object for each of its methods (where this python object then could be any fsspec-compatible filesystem).
    • Such a "PythonCallBackFilesystem" would allow that pyarrow can actually use the fsspec-based filesystems without converting them. It would provide easy compatibility (easy for the user, to be clear ;)), at the cost of performance (compared to pyarrow's native filesystems)
    • We could wrap incoming fsspec-filesystems in pyarrow, or fsspec could use such a class as baseclass when pyarrow is installed similarly as is done now.
    • I didn't yet investigate the feasibility of this option, but opened ARROW-8766 for it.

There is actually also a third option, and that is that some concrete fsspec implementations start to use one of the new pyarrow filesystems as its base, and then it would also be directly usable in pyarrow (but that's not my call to make, but up to those individual projects to be clear. For HDFS in fsspec that's probably what we want to do, though, since its implementation already depends on pyarrow).

As mentioned above, those options are not necessarily mutually exclusive. It might depend on the specific filesystem which option is desirable / possible (and the second option could also be a fallback for the first option if pyarrow doesn't support the specific file system).

Thoughts on this? About the feasibility of the specific options? Other options?


Note: the above is about the use case of "users can pass an fsspec filesystem to pyarrow". There is also the use case the other way around of "using pyarrow filesystems where fsspec-compliant filesystems are expected". For this, an fsspec-compliant wrapper around a pyarrow.fs filesystem is probably useful (and I suppose this is something that could live in pyarrow). For this there is https://issues.apache.org/jira/browse/ARROW-7102
Such a wrapper could also provide a richer API for those users who want that with the pyarrow filesystems (since the current pyarrow.fs filesystems are rather bare-bones)

cc @martindurant @TomAugspurger @pitrou

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions