-
-
Notifications
You must be signed in to change notification settings - Fork 18.5k
ENH: add fsspec support #34266
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
ENH: add fsspec support #34266
Changes from 18 commits
94e717f
fd7e072
302ba13
9e6d3b2
4564c8d
0654537
8d45cbb
006e736
724ebd8
9da1689
a595411
6dd1e92
6e13df7
3262063
4bc2411
68644ab
32bc586
037ef2c
c3c3075
85d6452
263dd3b
d0afbc3
6a587a5
b2992c1
9c03745
7982e7b
946297b
145306e
06e5a3a
8f3854c
50c08c8
9b20dc6
eb90fe8
b3e2cd2
4977a00
29a9785
565031b
606ce11
60b80a6
File filter
Filter by extension
Conversations
Jump to
Diff view
Diff view
There are no files selected for viewing
Original file line number | Diff line number | Diff line change |
---|---|---|
|
@@ -267,8 +267,9 @@ SQLAlchemy 1.1.4 SQL support for databases other tha | |
SciPy 0.19.0 Miscellaneous statistical functions | ||
XLsxWriter 0.9.8 Excel writing | ||
blosc Compression for HDF5 | ||
fsspec 0.7.4 File operations handling | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. I would clarify the note a bit further, as now it sounds you need this for any kind of file operations, something like "for remote filesystems" ? There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Hm, it handles other things too, just not "local" or "http(s)". That would make for a long comment. There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Then something like "filesystems other than local or http(s)" isn't too long I would say |
||
fastparquet 0.3.2 Parquet reading / writing | ||
gcsfs 0.2.2 Google Cloud Storage access | ||
gcsfs 0.6.0 Google Cloud Storage access | ||
html5lib HTML parser for read_html (see :ref:`note <optional_html>`) | ||
lxml 3.8.0 HTML parser for read_html (see :ref:`note <optional_html>`) | ||
matplotlib 2.2.2 Visualization | ||
|
@@ -282,7 +283,7 @@ pyreadstat SPSS files (.sav) reading | |
pytables 3.4.3 HDF5 reading / writing | ||
pyxlsb 1.0.6 Reading for xlsb files | ||
qtpy Clipboard I/O | ||
s3fs 0.3.0 Amazon S3 access | ||
s3fs 0.4.0 Amazon S3 access | ||
tabulate 0.8.3 Printing in Markdown-friendly format (see `tabulate`_) | ||
xarray 0.8.2 pandas-like API for N-dimensional data | ||
xclip Clipboard I/O on linux | ||
|
Original file line number | Diff line number | Diff line change |
---|---|---|
|
@@ -31,6 +31,7 @@ | |
|
||
from pandas._typing import FilePathOrBuffer | ||
from pandas.compat import _get_lzma_file, _import_lzma | ||
from pandas.compat._optional import import_optional_dependency | ||
|
||
from pandas.core.dtypes.common import is_file_like | ||
|
||
|
@@ -126,20 +127,6 @@ def stringify_path( | |
return _expand_user(filepath_or_buffer) | ||
|
||
|
||
def is_s3_url(url) -> bool: | ||
"""Check for an s3, s3n, or s3a url""" | ||
if not isinstance(url, str): | ||
return False | ||
return parse_url(url).scheme in ["s3", "s3n", "s3a"] | ||
|
||
|
||
def is_gcs_url(url) -> bool: | ||
"""Check for a gcs url""" | ||
if not isinstance(url, str): | ||
return False | ||
return parse_url(url).scheme in ["gcs", "gs"] | ||
|
||
|
||
def urlopen(*args, **kwargs): | ||
""" | ||
Lazy-import wrapper for stdlib urlopen, as that imports a big chunk of | ||
|
@@ -150,38 +137,20 @@ def urlopen(*args, **kwargs): | |
return urllib.request.urlopen(*args, **kwargs) | ||
|
||
|
||
def get_fs_for_path(filepath: str): | ||
def is_fsspec_url(url: FilePathOrBuffer) -> bool: | ||
""" | ||
Get appropriate filesystem given a filepath. | ||
Supports s3fs, gcs and local file system. | ||
|
||
Parameters | ||
---------- | ||
filepath : str | ||
File path. e.g s3://bucket/object, /local/path, gcs://pandas/obj | ||
|
||
Returns | ||
------- | ||
s3fs.S3FileSystem, gcsfs.GCSFileSystem, None | ||
Appropriate FileSystem to use. None for local filesystem. | ||
Returns true if fsspec is installed and the given URL looks like | ||
something fsspec can handle | ||
jreback marked this conversation as resolved.
Show resolved
Hide resolved
|
||
""" | ||
if is_s3_url(filepath): | ||
from pandas.io import s3 | ||
|
||
return s3.get_fs() | ||
elif is_gcs_url(filepath): | ||
from pandas.io import gcs | ||
|
||
return gcs.get_fs() | ||
else: | ||
return None | ||
return isinstance(url, str) and ("::" in url or "://" in url) | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. What's the There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. That's for compound URLs, e.g., to enable local caching like "simplecache::s3://bucket/path" (or indeed via dask workers) There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Is there ever one of those doesn't also include a There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. We do special-case to assume "file://" where there is no protocol, but happy to drop that possibility in this use case. There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. You're saying that something like I think for now I'd prefer to avoid that. There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. ok |
||
|
||
|
||
def get_filepath_or_buffer( | ||
filepath_or_buffer: FilePathOrBuffer, | ||
encoding: Optional[str] = None, | ||
compression: Optional[str] = None, | ||
mode: Optional[str] = None, | ||
**storage_options: Dict[str, Any], | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. do we want callers to pass a dict or collect additional kwargs here, the docstring implies the former. There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. We are not passing anything at all yet, so I don't mind whether it's kwargs or a dict keyword. I imagine in a user function like read_csv, there would be a
martindurant marked this conversation as resolved.
Show resolved
Hide resolved
|
||
): | ||
""" | ||
If the filepath_or_buffer is a url, translate and return the buffer. | ||
|
@@ -194,6 +163,8 @@ def get_filepath_or_buffer( | |
compression : {{'gzip', 'bz2', 'zip', 'xz', None}}, optional | ||
encoding : the encoding to use to decode bytes, default is 'utf-8' | ||
mode : str, optional | ||
storage_options: dict | ||
passed on to fsspec, if using it; this is not yet accessed by the public API | ||
|
||
Returns | ||
------- | ||
|
@@ -204,6 +175,7 @@ def get_filepath_or_buffer( | |
filepath_or_buffer = stringify_path(filepath_or_buffer) | ||
|
||
if isinstance(filepath_or_buffer, str) and is_url(filepath_or_buffer): | ||
# TODO: fsspec can also handle HTTP via requests, but leaving this unchanged | ||
req = urlopen(filepath_or_buffer) | ||
content_encoding = req.headers.get("Content-Encoding", None) | ||
if content_encoding == "gzip": | ||
|
@@ -213,19 +185,23 @@ def get_filepath_or_buffer( | |
req.close() | ||
return reader, encoding, compression, True | ||
|
||
if is_s3_url(filepath_or_buffer): | ||
from pandas.io import s3 | ||
|
||
return s3.get_filepath_or_buffer( | ||
filepath_or_buffer, encoding=encoding, compression=compression, mode=mode | ||
) | ||
|
||
if is_gcs_url(filepath_or_buffer): | ||
from pandas.io import gcs | ||
|
||
return gcs.get_filepath_or_buffer( | ||
filepath_or_buffer, encoding=encoding, compression=compression, mode=mode | ||
) | ||
if is_fsspec_url(filepath_or_buffer): | ||
assert isinstance( | ||
filepath_or_buffer, str | ||
) # just to appease mypy for this branch | ||
# two special-case s3-like protocols; these have special meaning in Hadoop, | ||
# but are equivalent to just "s3" from fsspec's point of view | ||
# cc #11071 | ||
if filepath_or_buffer.startswith("s3a://"): | ||
filepath_or_buffer = filepath_or_buffer.replace("s3a://", "s3://") | ||
if filepath_or_buffer.startswith("s3n://"): | ||
filepath_or_buffer = filepath_or_buffer.replace("s3n://", "s3://") | ||
fsspec = import_optional_dependency("fsspec") | ||
|
||
file_obj = fsspec.open( | ||
filepath_or_buffer, mode=mode or "rb", **storage_options | ||
).open() | ||
return file_obj, encoding, compression, True | ||
|
||
if isinstance(filepath_or_buffer, (str, bytes, mmap.mmap)): | ||
return _expand_user(filepath_or_buffer), None, compression, False | ||
|
This file was deleted.
Uh oh!
There was an error while loading. Please reload this page.