Skip to content

How to use nrows along with chunksize in read_json() #36791

Closed
@madolmo

Description

@madolmo
  • I have searched the [pandas] tag on StackOverflow for similar questions.

  • I have asked my usage related question on StackOverflow.


Question about pandas

Why do I need to use nrows when reading large json line files with chunksize option?
Since version 1.1 I'm having troubles with the function read_json() because even if I specify the option chunksize with the correct value (the value that used to work with pandas v.1.0.5), the file seems to be read at once, with a memroy error in my case. If I add the nrows option this doesn't happen but why? And what is the value you have to specify for the nrows parameter in order to load the entire file? Do you have to know in advance the maximum number of rows? Is there any special value for "all rows" like -1 o 0 ?

Thanks

#this raises a Memory Error (with a 4GB file) - this worked on version 1.0.5
reader = pd.read_json(f"{path}map_records.json",orient='records' ,lines=True, chunksize=100000)
chunks=[chunk[(chunk.bidbasket=="BSKGEOALL00000000001")&(chunk.tipomappa == "AULTIPMPS_GIT")][['bidsubzona','idoriginale','bidciv','bidbasket','tipomappa']]  for chunk in reader]

#this works, but it loads up to <nrorws> rows and I have to know the maximum number of rows in advance
reader = pd.read_json(f"{path}map_records.json",orient='records' ,lines=True, chunksize=100000, nrows=20000000)
chunks=[chunk[(chunk.bidbasket=="BSKGEOALL00000000001")&(chunk.tipomappa == "AULTIPMPS_GIT")][['bidsubzona','idoriginale','bidciv','bidbasket','tipomappa']]  for chunk in reader]

Metadata

Metadata

Assignees

No one assigned

    Labels

    IO JSONread_json, to_json, json_normalizeRegressionFunctionality that used to work in a prior pandas version

    Type

    No type

    Projects

    No projects

    Milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions