Skip to content

read_json different behaviour when file is os path or file url #27135

Closed
@gtmaskall

Description

@gtmaskall

Code Sample, a copy-pastable example if possible

pwd
/home/guy/tmp
echo -e '{"a": 1, "b": 2}\n{"a": 3, "b": 4}' > test.json
cat test.json 
{"a": 1, "b": 2}
{"a": 3, "b": 4}
# Your code here
[ins] In [74]: ospath = '/home/guy/tmp/test.json'                                                                       

[ins] In [75]: fileurl = 'file://localhost/home/guy/tmp/test.json'                                                      

[nav] In [76]: import pandas as pd                                                                                      

[ins] In [77]: pd.read_json(ospath, lines=True)                                                                         
Out[77]: 
   a  b
0  1  2
1  3  4

[ins] In [78]: pd.read_json(fileurl, lines=True)                                                                        
Out[78]: 
   a  b
0  1  2
1  3  4

[ins] In [79]: reader = pd.read_json(ospath, lines=True, chunksize=1)                                                   

[ins] In [80]: for chunk in reader: 
          ...:     print(chunk) 
          ...:                                                                                                          
   a  b
0  1  2
   a  b
1  3  4

[ins] In [81]: reader = pd.read_json(fileurl, lines=True, chunksize=1)

Problem description

Create a very simple two-line JSON file. Specify the location of the file two ways - using an OS path, and using a file URL. Both allow read_json() to read the JSON when read in one go. If using chunksize to create a reader, only the OS path specifier works. Trying to use the file path specifier produces a TypeError: sequence item 0: expected str instance, bytes found

The read_json doc says:
path_or_buf : a valid JSON string or file-like, default: None
The string could be a URL. Valid URL schemes include http, ftp, s3,
gcs, and file. For file URLs, a host is expected. For instance, a local
file could be file://localhost/path/to/table.json

This leads to the expectation that a file URL should behave the same as a simple OS path under all use cases.

I think this is the root cause of the problem described under issue #27022

# output when using fileurl:
TypeError                                 Traceback (most recent call last)
<ipython-input-82-605ab8a466fd> in <module>
----> 1 for chunk in reader:
      2     print(chunk)
      3 

~/anaconda3/envs/test_latest_pandas_json/lib/python3.7/site-packages/pandas/io/json/json.py in __next__(self)
    579         lines = list(islice(self.data, self.chunksize))
    580         if lines:
--> 581             lines_json = self._combine_lines(lines)
    582             obj = self._get_object_parser(lines_json)
    583 

~/anaconda3/envs/test_latest_pandas_json/lib/python3.7/site-packages/pandas/io/json/json.py in _combine_lines(self, lines)
    520         """
    521         lines = filter(None, map(lambda x: x.strip(), lines))
--> 522         return '[' + ','.join(lines) + ']'
    523 
    524     def read(self):

TypeError: sequence item 0: expected str instance, bytes found

[ins] In [83]: pd.version
Out[83]: '0.24.2'

Expected Output

Behaviour of read_json to be the same regardless of the type of file-like.

Output of pd.show_versions()

[paste the output of pd.show_versions() here below this line]

INSTALLED VERSIONS

commit: None
python: 3.7.3.final.0
python-bits: 64
OS: Linux
OS-release: 5.0.0-20-generic
machine: x86_64
processor: x86_64
byteorder: little
LC_ALL: None
LANG: en_GB.UTF-8
LOCALE: en_GB.UTF-8

pandas: 0.24.2
pytest: None
pip: 19.1.1
setuptools: 41.0.1
Cython: None
numpy: 1.16.4
scipy: None
pyarrow: None
xarray: None
IPython: 7.5.0
sphinx: None
patsy: None
dateutil: 2.8.0
pytz: 2019.1
blosc: None
bottleneck: None
tables: None
numexpr: None
feather: None
matplotlib: 3.1.0
openpyxl: None
xlrd: None
xlwt: None
xlsxwriter: None
lxml.etree: None
bs4: None
html5lib: None
sqlalchemy: None
pymysql: None
psycopg2: None
jinja2: 2.10.1
s3fs: None
fastparquet: None
pandas_gbq: None
pandas_datareader: None
gcsfs: None

Metadata

Metadata

Assignees

No one assigned

    Labels

    BugIO JSONread_json, to_json, json_normalize

    Type

    No type

    Projects

    No projects

    Milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions