Skip to content

CI: test for file leakage #30162

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 11 commits into from
Dec 11, 2019
1 change: 1 addition & 0 deletions ci/deps/azure-36-minimum_versions.yaml
Original file line number Diff line number Diff line change
Expand Up @@ -11,6 +11,7 @@ dependencies:
- pytest-xdist>=1.21
- hypothesis>=3.58.0
- pytest-azurepipelines
- psutil

# pandas dependencies
- beautifulsoup4=4.6.0
Expand Down
12 changes: 11 additions & 1 deletion pandas/io/excel/_base.py
Original file line number Diff line number Diff line change
Expand Up @@ -896,7 +896,7 @@ def close(self):
# https://stackoverflow.com/questions/31416842/
# openpyxl-does-not-close-excel-workbook-in-read-only-mode
wb = self.book
wb._archive.close()
wb.close()

if hasattr(self.io, "close"):
self.io.close()
Expand All @@ -910,3 +910,13 @@ def __exit__(self, exc_type, exc_value, traceback):
def __del__(self):
# Ensure we don't leak file descriptors
self.close()
if self.engine == "openpyxl":
# openpyxl 2.5.5 and earlier could leave open files behind
# https://bitbucket.org/openpyxl/openpyxl/issues/832
import openpyxl
from distutils.version import LooseVersion

if LooseVersion(openpyxl.__version__) <= LooseVersion("2.5.5"):
import gc
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

really this works? this is very weird to do inside a __del__.

reading the link I would rather so something like (and only in the openpyxl engine)

if LooseVersion(openpyxl.__version__) <= LooseVersion("2.5.5"):
     for s in self.sheets.values:
            s.xml_source.close()

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

maybe need a try/except around the close

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

it does work, agreed that calling gc is ugly, will try the xml_source.close thing.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

no dice, no sheets attribute

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

right i am not sure of the correct attribute here
but shoulda be clear if step into it

just really want to avoid the gc call in del

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@WillAyd any thoughts on how to make this work without gc call?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we move this cleanup to a test fixture used for the relevant tests? And just ensure that it's called for the relevant openpyxl version before check_for_file_leaks is run?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I guess as a general comment can we move this to the _openpyxl instead of checking the engine in the base class? Or does that not work?

I think you can just use the get_sheet_by_index property to iterate over the sheets and close the related xml_source

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hmm ignore comment before on moving to _openpyxl - I see this is for the ExcelFile class and not the Reader

I think you could do something like:

sheet_count = len(self._reader.sheet_names)
for i in range(sheet_count):
    sheet = self._reader.get_sheet_by_index(i)
    sheet.xml_source.close()

IIUC

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

sheet.xml_source.close()

still no dice. I'm sure there is some amount of digging around that can effectively re-implement the fix that openpyxl did, but im not wild about digging into openpyxl internals, especially for a known-fixed bug


gc.collect()
24 changes: 24 additions & 0 deletions pandas/tests/io/excel/conftest.py
Original file line number Diff line number Diff line change
@@ -1,5 +1,7 @@
import pytest

import pandas.util._test_decorators as td

import pandas.util.testing as tm

from pandas.io.parsers import read_csv
Expand Down Expand Up @@ -39,3 +41,25 @@ def read_ext(request):
Valid extensions for reading Excel files.
"""
return request.param


@pytest.fixture(autouse=True)
def check_for_file_leaks():
"""
Fixture to run around every test to ensure that we are not leaking files.
See also
--------
_test_decorators.check_file_leaks
"""
# GH#30162
psutil = td.safe_import("psutil")
if not psutil:
yield

else:
proc = psutil.Process()
flist = proc.open_files()
yield
flist2 = proc.open_files()
assert flist == flist2