Description
Feature Type
-
Adding new functionality to pandas
-
Changing existing functionality in pandas
-
Removing existing functionality in pandas
Problem Description
The current pandas.read_csv()
implementation is designed for robust and complete CSV parsing. However, even when users request only a few lines using nrows=X
, the function:
- Initializes the full parsing engine
- Performs column-wise type inference
- Scans for delimiter/header consistency
- May read a large portion or all of the file, even for small previews
For large datasets (10–100GB CSVs), this results in significant I/O, CPU, and memory overhead — all when the user likely just wants a quick preview of the data.
This is a common pattern in:
- Exploratory Data Analysis (EDA)
- Data cataloging and profiling
- Schema validation or column sniffing
- Dashboards and notebook tooling
Currently, users resort to workarounds like:
pd.read_csv(..., chunksize=5)
next(...)
or shell-level hacks like:
head -n 5 large_file.csv
These are non-intuitive, unstructured, or outside the pandas ecosystem.
Feature Description
Introduces a new Function
pandas.preview_csv(filepath_or_buffer, nrows=5, ...)
Goals
- Read only the first n rows + header lines
- Avoid loading or inferring types from null dataset
- No full cloumn validation
- Fallback to object dtype unless
dtype_infer = true
- Support basic options like delimiter, encoding, header presence.
Proposed API:
def preview_csv(
filepath_or_buffer,
nrows: int = 5,
delimiter: str = ",",
encoding: str = "utf-8",
has_header: bool = True,
dtype_infer: bool = False,
as_generator: bool = False
) -> pd.DataFrame:
...
Alternative Solutions
Tool / Method | Behavior | Limitation |
---|---|---|
pd.read_csv(nrows=X) |
Reads entire file into memory, performs dtype inference and column validation | Not optimized for quick previews; incurs overhead even for small nrows |
pd.read_csv(chunksize=X) |
Returns an iterator of chunks (DataFrames of size X ) |
Requires non-intuitive iterator handling; users often want DataFrame directly |
csv.reader + slicing |
Python’s built-in CSV reader is lightweight and fast | Returns raw lists, not a DataFrame; lacks header handling and column inference |
subprocess.run(["head", "-n"]) |
OS-level utility that returns first N lines | Not portable across platforms, doesn't integrate with DataFrame workflow |
Polars: pl.read_csv(..., n_rows) |
Rust-based, blazing fast CSV reader | Requires installing a new library; pandas users might not want to switch ecosystems |
Dask: dd.read_csv(...).head() |
Lazy, out-of-core loading with chunked processing | Overhead of distributed engine is unnecessary for simple previews |
open(...).readlines(N) |
Naive Python read of first N lines | Doesn’t handle parsing, delimiters, or schema properly |
pyarrow.csv.read_csv(...)[0:X] |
Efficient Arrow-based preview | Requires using Apache Arrow APIs; returns Arrow tables unless converted |
While workarounds exist, none provide a clean, idiomatic, native pandas function to:
- Efficiently load the first N rows
- Return a
DataFrame
immediately - Avoid dtype inference
- Skip full file validation
- Avoid requiring third-party dependencies
A dedicated pandas.preview_csv()
would fill this gap and offer an elegant, performant solution for quick data previews.
Additional Context
No response