Skip to content

Unexpected ValueError when using a json reader to read file from disk using chunksize #27022

Closed
@vlbrown

Description

@vlbrown

Sent to [email protected] on June 23 ([email protected]); posted here as recommended by Marc Garcia.

Code Sample, a copy-pastable example if possible

path = 'file://localhost/Users/vlb/Learn/DSC_Intro/'
filename = path + 'yelp_dataset/review_test.json'

# read the entire file -- this works
reviews = pd.read_json(filename, lines=True)
reviews.info()

# create a reader to read in chunks -- this part seems to work
review_reader = pd.read_json(StringIO(filename), lines=True, chunksize=1)
type(review_reader)

# But trying to read from the reader throws an error
# ValueError: Unexpected character found when decoding 'false'
for chunk in review_reader:
    print(chunk)

Data Samples

Either or both of the following records can be used

{"review_id":"rEITo90tpyKmEfNDp3Ou3A","user_id":"6Fz_nus_OG4gar721OKgZA","business_id":"6lj2BJ4tJeu7db5asGHQ4w","stars":5.0,"useful":0,"funny":0,"cool":0,"text":"We've been a huge Slim's fan since they opened one up in Texas about two years ago when we used to live there. This place never disappoints. They even have great salads and grilled chicken. Plus they have fresh brewed sweet tea, it's the best!","date":"2017-05-26 01:23:19"}
{"review_id":"Amo5gZBvCuPc_tZNpHwtsA","user_id":"DzZ7piLBF-WsJxqosfJgtA","business_id":"qx6WhZ42eDKmBchZDax4dQ","stars":5.0,"useful":1,"funny":0,"cool":0,"text":"Our family LOVES the food here. Quick, friendly, delicious, and a great restaurant to take kids to. 5 stars!","date":"2017-03-27 01:14:37"}

Problem description

Problem description

I am working a tutorial that uses a JSON data file from Yelp. The file is huge, so it needs to be read in chunks.

I get an unexpected error: ValueError: Unexpected character found when decoding 'false'

For testing purposes, I have reduced the dataset to a much smaller file with only 3 lines. I can reproduce the error with that file as well as with a file containing only one (any one) of the three lines.

Note that if I simply read in the entire (test) data set in one go, that works. It's only when I create a reader and try to review the chunks that I get the error.

Expected Output

No errors. A chunk should print.

If there is an error, it should be less opaque than "Unexpected character found when decoding 'false'".

Output of pd.show_versions()

[paste the output of pd.show_versions() here below this line]

INSTALLED VERSIONS

commit: None
python: 3.7.3.final.0
python-bits: 64
OS: Darwin
OS-release: 15.6.0
machine: x86_64
processor: i386
byteorder: little
LC_ALL: None
LANG: en_US.UTF-8
LOCALE: en_US.UTF-8

pandas: 0.24.2
pytest: 4.3.1
pip: 19.0.3
setuptools: 40.8.0
Cython: 0.29.6
numpy: 1.16.2
scipy: 1.2.1
pyarrow: None
xarray: None
IPython: 7.4.0
sphinx: 1.8.5
patsy: 0.5.1
dateutil: 2.8.0
pytz: 2018.9
blosc: None
bottleneck: 1.2.1
tables: 3.5.1
numexpr: 2.6.9
feather: None
matplotlib: 3.0.3
openpyxl: 2.6.1
xlrd: 1.2.0
xlwt: 1.3.0
xlsxwriter: 1.1.5
lxml.etree: 4.3.2
bs4: 4.7.1
html5lib: 1.0.1
sqlalchemy: 1.3.1
pymysql: None
psycopg2: None
jinja2: 2.10
s3fs: None
fastparquet: None
pandas_gbq: None
pandas_datareader: 0.7.0
gcsfs: None

Metadata

Metadata

Assignees

No one assigned

    Labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions