Skip to content

ENH: preview_csv(***.csv) for Fast First-N-Line Preview on Large Plus Size (>100GB) #61281

Open
@visheshrwl

Description

@visheshrwl

Feature Type

  • Adding new functionality to pandas

  • Changing existing functionality in pandas

  • Removing existing functionality in pandas

Problem Description

The current pandas.read_csv() implementation is designed for robust and complete CSV parsing. However, even when users request only a few lines using nrows=X, the function:

  • Initializes the full parsing engine
  • Performs column-wise type inference
  • Scans for delimiter/header consistency
  • May read a large portion or all of the file, even for small previews

For large datasets (10–100GB CSVs), this results in significant I/O, CPU, and memory overhead — all when the user likely just wants a quick preview of the data.

This is a common pattern in:

  • Exploratory Data Analysis (EDA)
  • Data cataloging and profiling
  • Schema validation or column sniffing
  • Dashboards and notebook tooling

Currently, users resort to workarounds like:

pd.read_csv(..., chunksize=5)
next(...)

or shell-level hacks like:

head -n 5 large_file.csv

These are non-intuitive, unstructured, or outside the pandas ecosystem.

Feature Description

Introduces a new Function

pandas.preview_csv(filepath_or_buffer, nrows=5, ...)

Goals

  • Read only the first n rows + header lines
  • Avoid loading or inferring types from null dataset
  • No full cloumn validation
  • Fallback to object dtype unless dtype_infer = true
  • Support basic options like delimiter, encoding, header presence.

Proposed API:

def preview_csv(
    filepath_or_buffer,
    nrows: int = 5,
    delimiter: str = ",",
    encoding: str = "utf-8",
    has_header: bool = True,
    dtype_infer: bool = False,
    as_generator: bool = False
) -> pd.DataFrame:
    ...

Alternative Solutions

Tool / Method Behavior Limitation
pd.read_csv(nrows=X) Reads entire file into memory, performs dtype inference and column validation Not optimized for quick previews; incurs overhead even for small nrows
pd.read_csv(chunksize=X) Returns an iterator of chunks (DataFrames of size X) Requires non-intuitive iterator handling; users often want DataFrame directly
csv.reader + slicing Python’s built-in CSV reader is lightweight and fast Returns raw lists, not a DataFrame; lacks header handling and column inference
subprocess.run(["head", "-n"]) OS-level utility that returns first N lines Not portable across platforms, doesn't integrate with DataFrame workflow
Polars: pl.read_csv(..., n_rows) Rust-based, blazing fast CSV reader Requires installing a new library; pandas users might not want to switch ecosystems
Dask: dd.read_csv(...).head() Lazy, out-of-core loading with chunked processing Overhead of distributed engine is unnecessary for simple previews
open(...).readlines(N) Naive Python read of first N lines Doesn’t handle parsing, delimiters, or schema properly
pyarrow.csv.read_csv(...)[0:X] Efficient Arrow-based preview Requires using Apache Arrow APIs; returns Arrow tables unless converted

While workarounds exist, none provide a clean, idiomatic, native pandas function to:

  • Efficiently load the first N rows
  • Return a DataFrame immediately
  • Avoid dtype inference
  • Skip full file validation
  • Avoid requiring third-party dependencies

A dedicated pandas.preview_csv() would fill this gap and offer an elegant, performant solution for quick data previews.

Additional Context

No response

Metadata

Metadata

Assignees

No one assigned

    Labels

    EnhancementIO CSVread_csv, to_csvNeeds DiscussionRequires discussion from core team before further action

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions