Skip to content

ENH: Automate reading of data in chunks #61110

Open
@acampove

Description

@acampove

Feature Type

  • Adding new functionality to pandas

  • Changing existing functionality in pandas

  • Removing existing functionality in pandas

Problem Description

I have a file with 20Gb of data that I need to process. When I use a pandas dataframe, the full 20Gb need to be loaded. That will make the computer slow or even crash. Can this process be made more efficient by automatically (very very important that the user does not have to do anything here) loads a chunk, processes it, writes it, loads the second chunk, etc.

This is stuff is possible, it is done by ROOT for instance.

Feature Description

This would just work with the normal dataframes, there could be an option like

pd.chunk_size = 100

which would process 100Mb at a time. So that no more than 100 Mb would be in memory.

Alternative Solutions

Alternatively we can

import ROOT

rdf = ROOT.RDataFrame('tree', 'path_to_file.root')

Additional Context

No response

Metadata

Metadata

Assignees

No one assigned

    Labels

    EnhancementNeeds InfoClarification about behavior needed to assess issue

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions