Skip to content

to_parquet fails when S3 is the destination #19134

Closed
@maximveksler

Description

@maximveksler
import pandas as pd

df = pd.DataFrame({'field': [1,2,3]})
df.to_parquet("s3://pandas-test/test.parquet", engine='pyarrow')

Problem description

pandas uses S3FS for writing files to S3. S3File objects are being opened in rb mode.

There are several possible fail cases in the form of an exceptions in the fail chain.

3 different components

  • S3Filesystem,
  • pyarrow writer,
  • fastparquet reader & writer.

pyarrow - write attempt

FileNotFoundError or ValueError (depends on if file exists in S3 or not).

fastparquet - read attempt

Exception in attempting to concat str and S3File

fastparquet - write attempt

Exception in attempting to open path using default_open

The above code produces

C:\Users\maxim.veksler\source\DTank\venv\Scripts\python.exe C:/Users/maxim.veksler/source/DTank/DTank/par.py
Traceback (most recent call last):
  File "C:\Users\maxim.veksler\source\DTank\venv\lib\site-packages\s3fs\core.py", line 396, in info
    kwargs, Bucket=bucket, Key=key, **self.req_kw)
  File "C:\Users\maxim.veksler\source\DTank\venv\lib\site-packages\s3fs\core.py", line 170, in _call_s3
    return method(**additional_kwargs)
  File "C:\Users\maxim.veksler\source\DTank\venv\lib\site-packages\botocore\client.py", line 314, in _api_call
    return self._make_api_call(operation_name, kwargs)
  File "C:\Users\maxim.veksler\source\DTank\venv\lib\site-packages\botocore\client.py", line 612, in _make_api_call
    raise error_class(parsed_response, operation_name)
botocore.exceptions.ClientError: An error occurred (404) when calling the HeadObject operation: Not Found

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "c:\users\maxim.veksler\source\pandas\pandas\io\s3.py", line 25, in get_filepath_or_buffer
    filepath_or_buffer = fs.open(_strip_schema(filepath_or_buffer))
  File "C:\Users\maxim.veksler\source\DTank\venv\lib\site-packages\s3fs\core.py", line 293, in open
    fill_cache=fill_cache, s3_additional_kwargs=kwargs)
  File "C:\Users\maxim.veksler\source\DTank\venv\lib\site-packages\s3fs\core.py", line 931, in __init__
    self.size = self.info()['Size']
  File "C:\Users\maxim.veksler\source\DTank\venv\lib\site-packages\s3fs\core.py", line 940, in info
    return self.s3.info(self.path, **kwargs)
  File "C:\Users\maxim.veksler\source\DTank\venv\lib\site-packages\s3fs\core.py", line 402, in info
    raise FileNotFoundError(path)
FileNotFoundError: pandas-test/test.parquet

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "C:\Users\maxim.veksler\source\DTank\venv\lib\site-packages\s3fs\core.py", line 396, in info
    kwargs, Bucket=bucket, Key=key, **self.req_kw)
  File "C:\Users\maxim.veksler\source\DTank\venv\lib\site-packages\s3fs\core.py", line 170, in _call_s3
    return method(**additional_kwargs)
  File "C:\Users\maxim.veksler\source\DTank\venv\lib\site-packages\botocore\client.py", line 314, in _api_call
    return self._make_api_call(operation_name, kwargs)
  File "C:\Users\maxim.veksler\source\DTank\venv\lib\site-packages\botocore\client.py", line 612, in _make_api_call
    raise error_class(parsed_response, operation_name)
botocore.exceptions.ClientError: An error occurred (403) when calling the HeadObject operation: Forbidden

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "C:/Users/maxim.veksler/source/DTank/DTank/par.py", line 4, in <module>
    df.to_parquet("s3://pandas-test/test.parquet", engine='pyarrow')
  File "c:\users\maxim.veksler\source\pandas\pandas\core\frame.py", line 1649, in to_parquet
    compression=compression, **kwargs)
  File "c:\users\maxim.veksler\source\pandas\pandas\io\parquet.py", line 227, in to_parquet
    return impl.write(df, path, compression=compression, **kwargs)
  File "c:\users\maxim.veksler\source\pandas\pandas\io\parquet.py", line 110, in write
    path, _, _ = get_filepath_or_buffer(path)
  File "c:\users\maxim.veksler\source\pandas\pandas\io\common.py", line 202, in get_filepath_or_buffer
    compression=compression)
  File "c:\users\maxim.veksler\source\pandas\pandas\io\s3.py", line 34, in get_filepath_or_buffer
    filepath_or_buffer = fs.open(_strip_schema(filepath_or_buffer))
  File "C:\Users\maxim.veksler\source\DTank\venv\lib\site-packages\s3fs\core.py", line 293, in open
    fill_cache=fill_cache, s3_additional_kwargs=kwargs)
  File "C:\Users\maxim.veksler\source\DTank\venv\lib\site-packages\s3fs\core.py", line 931, in __init__
    self.size = self.info()['Size']
  File "C:\Users\maxim.veksler\source\DTank\venv\lib\site-packages\s3fs\core.py", line 940, in info
    return self.s3.info(self.path, **kwargs)
  File "C:\Users\maxim.veksler\source\DTank\venv\lib\site-packages\s3fs\core.py", line 402, in info
    raise FileNotFoundError(path)
FileNotFoundError: pandas-test/test.parquet

Process finished with exit code 1

Output of pd.show_versions()

INSTALLED VERSIONS

commit: None
python: 3.6.3.final.0
python-bits: 64
OS: Windows
OS-release: 10
machine: AMD64
processor: Intel64 Family 6 Model 142 Stepping 9, GenuineIntel
byteorder: little
LC_ALL: None
LANG: None
LOCALE: None.None

pandas: 0.22.0
pytest: 2.8.0
pip: 9.0.1
setuptools: 38.4.0
Cython: 0.27.3
numpy: 1.13.3
scipy: None
pyarrow: 0.7.1
xarray: None
IPython: None
sphinx: None
patsy: None
dateutil: 2.6.1
pytz: 2017.3
blosc: None
bottleneck: None
tables: None
numexpr: None
feather: None
matplotlib: 2.1.0
openpyxl: None
xlrd: None
xlwt: None
xlsxwriter: None
lxml: None
bs4: None
html5lib: None
sqlalchemy: 1.1.15
pymysql: None
psycopg2: None
jinja2: 2.10
s3fs: 0.1.2
fastparquet: None
pandas_gbq: None
pandas_datareader: None

Metadata

Metadata

Assignees

No one assigned

    Labels

    IO DataIO issues that don't fit into a more specific labelIO Parquetparquet, feather

    Type

    No type

    Projects

    No projects

    Milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions