Description
Feature Type
-
Adding new functionality to pandas
-
Changing existing functionality in pandas
-
Removing existing functionality in pandas
Problem Description
When using pd.read_sql_query
with chunksize=
set, the result is an iterator of data frames.
Often the underlying database library exposes a total row count for the query, which can be used to estimate task completion time, allocate resources, etc.
However this is not exposed anywhere in the pd.read_sql
/pd.read_sql_query
/pd.read_sql_table
interface. It would be very useful if users could access this, perhaps as an attribute on the iterator object itself.
One hypothetical usage among many:
df_iter = pd.read_sql_query(query, engine, chunksize=k)
for chunk in tqdm(df_iter, total=df_iter.rowcount):
...
Ideally .rowcount
would also be exposed as __len__
, so len(df_iter) == df_iter.rowcount
.
Feature Description
This can be facilitated by pulling out the implementation of pandas.io.sql.SQLDatabase._query_iterator
into a helper class:
from collections.abc import Sequence
from typing import Any
import sqlalchemy as sqla
class QueryIterator:
def __init__(self, result: attribute sqla.engine.CursorResult) -> None:
self.result = result
def __iter__(self) -> 'QueryIterator':
return self
@property
def rowcount(self) -> int:
return self.result.rowcount
def __len__(self) -> int:
return self.result.rowcount
def __next__(self) -> Sequence[sqla.engine.Row[Any]]:
... # implementation here
Alternative Solutions
Pre-computing SELECT count(*)
is not always an option, and generally is not a great design, partly because of TOCTOU inconsistentcy and partly because it's just not necessary in most cases.
Additional Context
I'm happy to block out some time to work on this, if the feature is of interest and it's not already WIP somewhere.