Skip to content

REF: make coordinates not a state variable in io.pytables #29805

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
wants to merge 1 commit into from
Closed
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
12 changes: 4 additions & 8 deletions pandas/io/pytables.py
Original file line number Diff line number Diff line change
Expand Up @@ -1634,7 +1634,6 @@ def __init__(
self.start = start
self.stop = stop

self.coordinates = None
if iterator or chunksize is not None:
if chunksize is None:
chunksize = 100000
Expand All @@ -1644,14 +1643,12 @@ def __init__(

self.auto_close = auto_close

def __iter__(self):
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hmm on board with the goal, but is this really what we want to do? Wouldn't this mean that the TableIterator class is actually no longer an iterator by definition?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I guess. In the case that doesn't go through 1674-1675 below TableIterator is already not really an iterator, so in that sense this makes the name consistently inaccuraet

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ha sounds good. I'll defer to others more familiar with this code

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

so there is an open issue to have this just subclass BaseIterator which should preserve the correct iterator behavior. I would rather fix it that way.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't see the issue you're referring to, but before long I'll do a dedicated look through the HDF5 issues to see what can be closed.

It isn't clear that subclassing BaseIterator would affect the statefulness that this PR is addressing

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

maybe #12953 or #10310

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

umm, not those. but in any event i think if you subclass BaseIterator this will just work

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

do you mean collections.abc.Iterator? that would mean we'd have to define __next__ and __iter__ would return self. But for __next__ to work, it the relevant coordinates variable would have to be stateful, which is exactly what this PR is try to avoid

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this is a non-standard way of doing things, it needs to inheirt from https://github.com/pandas-dev/pandas/blob/master/pandas/io/common.py#L81

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the link.

If I'm reading you right, you are adamant that TableIterator itself should be an iterator. Is that accurate?


# iterate
def iter_result(self, coordinates):
current = self.start
while current < self.stop:

stop = min(current + self.chunksize, self.stop)
value = self.func(None, None, self.coordinates[current:stop])
value = self.func(None, None, coordinates[current:stop])
current = stop
if value is None or not len(value):
continue
Expand All @@ -1671,9 +1668,8 @@ def get_result(self, coordinates: bool = False):
if not self.s.is_table:
raise TypeError("can only use an iterator or chunksize on a table")

self.coordinates = self.s.read_coordinates(where=self.where)

return self
coordinates = self.s.read_coordinates(where=self.where)
return self.iter_result(coordinates)

# if specified read via coordinates (necessary for multiple selections
if coordinates:
Expand Down
3 changes: 0 additions & 3 deletions pandas/tests/io/pytables/test_store.py
Original file line number Diff line number Diff line change
Expand Up @@ -52,7 +52,6 @@
)

from pandas.io import pytables as pytables # noqa: E402 isort:skip
from pandas.io.pytables import TableIterator # noqa: E402 isort:skip


_default_compressor = "blosc"
Expand Down Expand Up @@ -4528,10 +4527,8 @@ def test_read_hdf_iterator(self, setup_path):
df.to_hdf(path, "df", mode="w", format="t")
direct = read_hdf(path, "df")
iterator = read_hdf(path, "df", iterator=True)
assert isinstance(iterator, TableIterator)
indirect = next(iterator.__iter__())
tm.assert_frame_equal(direct, indirect)
iterator.store.close()
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think its a problem not closing the iterator; the fact that you had to change this is suspicious

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

iter_result does close the store


def test_read_hdf_errors(self, setup_path):
df = DataFrame(np.random.rand(4, 5), index=list("abcd"), columns=list("ABCDE"))
Expand Down