Description
Code Sample, a copy-pastable example if possible
Using Python
import pandas as pd
inputdf = pd.read_json(path_or_buf="s3://path/to/python-lines/file.json", lines=True)
The file is similar to:
{"url": "blah", "other": "blah"}
{"url": "blah", "other": "blah"}
{"url": "blah", "other": "blah"}
Problem description
When attempting to read a python lines file into a DataFrame using the s3 protocol, the above code will error with:
2017-08-08 11:06:14,225 - image_rank_csv - ERROR - initial_value must be str or None, not bytes
Traceback (most recent call last):
File "image_rank_csv.py", line 62, in run
inputdf = pd.read_json(path_or_buf="s3://path/to/python-lines/file.json", lines=True)
File "...env/lib/python3.6/site-packages/pandas/io/json/json.py", line 347, in read_json
lines = list(StringIO(json.strip()))
TypeError: initial_value must be str or None, not bytes
This works fine if the file is local, e.g.:
import pandas as pd
inputdf = pd.read_json(path_or_buf="/local/path/to/python-lines/file.json", lines=True)
Expected Output
Expect to successfully read the file and error above not to occur.
My current thinking is that when we get the file handle: https://github.com/pandas-dev/pandas/blob/v0.20.3/pandas/io/json/json.py#L333 , you delegate to s3fs
, which documents that it only operates in Binary mode. Therefore when you read()
: https://github.com/pandas-dev/pandas/blob/v0.20.3/pandas/io/json/json.py#L335, Therefore passing to StringIO
will fail here: https://github.com/pandas-dev/pandas/blob/v0.20.3/pandas/io/json/json.py#L347 . Maybe it needs a different handler for BytesIO
?
Output of pd.show_versions()
pandas: 0.20.3
pytest: None
pip: 9.0.1
setuptools: 36.2.7
Cython: None
numpy: 1.12.0
scipy: 0.19.1
xarray: None
IPython: None
sphinx: None
patsy: None
dateutil: 2.6.0
pytz: 2017.2
blosc: None
bottleneck: None
tables: None
numexpr: None
feather: None
matplotlib: None
openpyxl: None
xlrd: None
xlwt: None
xlsxwriter: None
lxml: None
bs4: None
html5lib: None
sqlalchemy: None
pymysql: None
psycopg2: 2.6.2 (dt dec pq3 ext lo64)
jinja2: None
s3fs: 0.1.2
pandas_gbq: None
pandas_datareader: None
</details>