Skip to content

read_json(lines=True) broken for s3 urls in Python 3 (v0.20.3) #17200

Closed
@alph486

Description

@alph486

Code Sample, a copy-pastable example if possible

Using Python

import pandas as pd
inputdf = pd.read_json(path_or_buf="s3://path/to/python-lines/file.json", lines=True)

The file is similar to:

{"url": "blah", "other": "blah"}
{"url": "blah", "other": "blah"}
{"url": "blah", "other": "blah"}

Problem description

When attempting to read a python lines file into a DataFrame using the s3 protocol, the above code will error with:

2017-08-08 11:06:14,225 - image_rank_csv - ERROR - initial_value must be str or None, not bytes
Traceback (most recent call last):
  File "image_rank_csv.py", line 62, in run
    inputdf = pd.read_json(path_or_buf="s3://path/to/python-lines/file.json", lines=True)
  File "...env/lib/python3.6/site-packages/pandas/io/json/json.py", line 347, in read_json
    lines = list(StringIO(json.strip()))
TypeError: initial_value must be str or None, not bytes

This works fine if the file is local, e.g.:

import pandas as pd
inputdf = pd.read_json(path_or_buf="/local/path/to/python-lines/file.json", lines=True)

Expected Output

Expect to successfully read the file and error above not to occur.

My current thinking is that when we get the file handle: https://github.com/pandas-dev/pandas/blob/v0.20.3/pandas/io/json/json.py#L333 , you delegate to s3fs, which documents that it only operates in Binary mode. Therefore when you read(): https://github.com/pandas-dev/pandas/blob/v0.20.3/pandas/io/json/json.py#L335, Therefore passing to StringIO will fail here: https://github.com/pandas-dev/pandas/blob/v0.20.3/pandas/io/json/json.py#L347 . Maybe it needs a different handler for BytesIO?

Output of pd.show_versions()

[paste the output of ``pd.show_versions()`` here below this line] ``` INSTALLED VERSIONS ------------------ commit: None python: 3.6.1.final.0 python-bits: 64 OS: Darwin OS-release: 16.6.0 machine: x86_64 processor: i386 byteorder: little LC_ALL: None LANG: en_US.UTF-8 LOCALE: en_US.UTF-8

pandas: 0.20.3
pytest: None
pip: 9.0.1
setuptools: 36.2.7
Cython: None
numpy: 1.12.0
scipy: 0.19.1
xarray: None
IPython: None
sphinx: None
patsy: None
dateutil: 2.6.0
pytz: 2017.2
blosc: None
bottleneck: None
tables: None
numexpr: None
feather: None
matplotlib: None
openpyxl: None
xlrd: None
xlwt: None
xlsxwriter: None
lxml: None
bs4: None
html5lib: None
sqlalchemy: None
pymysql: None
psycopg2: 2.6.2 (dt dec pq3 ext lo64)
jinja2: None
s3fs: 0.1.2
pandas_gbq: None
pandas_datareader: None

</details>

Metadata

Metadata

Assignees

No one assigned

    Labels

    IO DataIO issues that don't fit into a more specific labelIO JSONread_json, to_json, json_normalize

    Type

    No type

    Projects

    No projects

    Milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions