Skip to content

StataReader processes whole file before reading in chunks #48700

Closed
@sterlinm

Description

@sterlinm

I've noticed that when reading large Stata files using the chunksize parameter the time it takes to create the StataReader object is affected by the size of the file. This is a bit surprising since all of the metadata it needs is contained in the file header so it seems like it should take the same time regardless of the total file size.

I took a look at the code and it seems like the culprit is this line that reads the entire file into a BytesIO object before parsing the header. I'm not entirely sure what this accomplishes. Ideally it would be nice to be able to create the StataReader object after processing just the header portion of the file.

pandas/pandas/io/stata.py

Lines 1167 to 1175 in 71fc89c

with get_handle(
path_or_buf,
"rb",
storage_options=storage_options,
is_text=False,
compression=compression,
) as handles:
# Copy to BytesIO, and ensure no encoding
self.path_or_buf = BytesIO(handles.handle.read())

Metadata

Metadata

Assignees

No one assigned

    Labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions