Description
Code Sample, a copy-pastable example if possible
pwd
/home/guy/tmp
echo -e '{"a": 1, "b": 2}\n{"a": 3, "b": 4}' > test.json
cat test.json
{"a": 1, "b": 2}
{"a": 3, "b": 4}
# Your code here
[ins] In [74]: ospath = '/home/guy/tmp/test.json'
[ins] In [75]: fileurl = 'file://localhost/home/guy/tmp/test.json'
[nav] In [76]: import pandas as pd
[ins] In [77]: pd.read_json(ospath, lines=True)
Out[77]:
a b
0 1 2
1 3 4
[ins] In [78]: pd.read_json(fileurl, lines=True)
Out[78]:
a b
0 1 2
1 3 4
[ins] In [79]: reader = pd.read_json(ospath, lines=True, chunksize=1)
[ins] In [80]: for chunk in reader:
...: print(chunk)
...:
a b
0 1 2
a b
1 3 4
[ins] In [81]: reader = pd.read_json(fileurl, lines=True, chunksize=1)
Problem description
Create a very simple two-line JSON file. Specify the location of the file two ways - using an OS path, and using a file URL. Both allow read_json() to read the JSON when read in one go. If using chunksize to create a reader, only the OS path specifier works. Trying to use the file path specifier produces a TypeError: sequence item 0: expected str instance, bytes found
The read_json doc says:
path_or_buf : a valid JSON string or file-like, default: None
The string could be a URL. Valid URL schemes include http, ftp, s3,
gcs, and file. For file URLs, a host is expected. For instance, a local
file could be file://localhost/path/to/table.json
This leads to the expectation that a file URL should behave the same as a simple OS path under all use cases.
I think this is the root cause of the problem described under issue #27022
# output when using fileurl:
TypeError Traceback (most recent call last)
<ipython-input-82-605ab8a466fd> in <module>
----> 1 for chunk in reader:
2 print(chunk)
3
~/anaconda3/envs/test_latest_pandas_json/lib/python3.7/site-packages/pandas/io/json/json.py in __next__(self)
579 lines = list(islice(self.data, self.chunksize))
580 if lines:
--> 581 lines_json = self._combine_lines(lines)
582 obj = self._get_object_parser(lines_json)
583
~/anaconda3/envs/test_latest_pandas_json/lib/python3.7/site-packages/pandas/io/json/json.py in _combine_lines(self, lines)
520 """
521 lines = filter(None, map(lambda x: x.strip(), lines))
--> 522 return '[' + ','.join(lines) + ']'
523
524 def read(self):
TypeError: sequence item 0: expected str instance, bytes found
[ins] In [83]: pd.version
Out[83]: '0.24.2'
Expected Output
Behaviour of read_json to be the same regardless of the type of file-like.
Output of pd.show_versions()
[paste the output of pd.show_versions()
here below this line]
INSTALLED VERSIONS
commit: None
python: 3.7.3.final.0
python-bits: 64
OS: Linux
OS-release: 5.0.0-20-generic
machine: x86_64
processor: x86_64
byteorder: little
LC_ALL: None
LANG: en_GB.UTF-8
LOCALE: en_GB.UTF-8
pandas: 0.24.2
pytest: None
pip: 19.1.1
setuptools: 41.0.1
Cython: None
numpy: 1.16.4
scipy: None
pyarrow: None
xarray: None
IPython: 7.5.0
sphinx: None
patsy: None
dateutil: 2.8.0
pytz: 2019.1
blosc: None
bottleneck: None
tables: None
numexpr: None
feather: None
matplotlib: 3.1.0
openpyxl: None
xlrd: None
xlwt: None
xlsxwriter: None
lxml.etree: None
bs4: None
html5lib: None
sqlalchemy: None
pymysql: None
psycopg2: None
jinja2: 2.10.1
s3fs: None
fastparquet: None
pandas_gbq: None
pandas_datareader: None
gcsfs: None