-
-
Notifications
You must be signed in to change notification settings - Fork 18.5k
ENH: Add calamite engine to read_excel
#50581
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Changes from 43 commits
30da9a4
a47d3fb
6c1dd87
fd06ad9
8b6200a
6a8d822
efcb2fc
e1105de
6b50e0c
0784733
5971199
655318b
cc049cf
038133e
52c2cbd
2dc5e02
2076e11
6b0a7ac
a614089
256f9f9
eee8b4e
9fc2209
cf1268a
bebfec5
9019904
08a5616
677a224
8c55e5d
d817999
12aaf19
255e8fb
500fa9f
5d94728
15874c3
89ae49e
33e5b7e
85d31ec
a0d4193
a6b6fb2
0a431c5
745cd09
942a16a
8803ca9
2f5ffba
b8b1a9a
f5ab40d
02c2e7f
74a3e70
File filter
Filter by extension
Conversations
Jump to
Diff view
Diff view
There are no files selected for viewing
Original file line number | Diff line number | Diff line change |
---|---|---|
|
@@ -55,4 +55,5 @@ dependencies: | |
- zstandard>=0.15.2 | ||
|
||
- pip: | ||
- python-calamine | ||
- tzdata>=2022.1 |
Original file line number | Diff line number | Diff line change |
---|---|---|
|
@@ -55,4 +55,5 @@ dependencies: | |
- zstandard>=0.15.2 | ||
|
||
- pip: | ||
- python-calamine>=0.1.0 | ||
- tzdata>=2022.1 |
Original file line number | Diff line number | Diff line change |
---|---|---|
|
@@ -70,4 +70,5 @@ dependencies: | |
- py | ||
|
||
- pip: | ||
- python-calamine | ||
- tzdata>=2022.1 |
Original file line number | Diff line number | Diff line change |
---|---|---|
|
@@ -59,4 +59,5 @@ dependencies: | |
|
||
- pip: | ||
- pyqt5==5.15.1 | ||
- python-calamine==0.1.0 | ||
- tzdata==2022.1 |
Original file line number | Diff line number | Diff line change |
---|---|---|
|
@@ -55,4 +55,5 @@ dependencies: | |
- zstandard>=0.15.2 | ||
|
||
- pip: | ||
- python-calamine | ||
- tzdata>=2022.1 |
Original file line number | Diff line number | Diff line change |
---|---|---|
|
@@ -55,4 +55,5 @@ dependencies: | |
- zstandard>=0.15.2 | ||
|
||
- pip: | ||
- python-calamine | ||
- tzdata>=2022.1 |
Original file line number | Diff line number | Diff line change |
---|---|---|
|
@@ -54,3 +54,6 @@ dependencies: | |
- xlrd>=2.0.1 | ||
- xlsxwriter>=1.4.3 | ||
- zstandard>=0.15.2 | ||
|
||
- pip: | ||
- python-calamine |
Original file line number | Diff line number | Diff line change |
---|---|---|
|
@@ -279,6 +279,7 @@ Other enhancements | |
- :meth:`Series.dropna` and :meth:`DataFrame.dropna` has gained ``ignore_index`` keyword to reset index (:issue:`31725`) | ||
- Improved error message in :func:`to_datetime` for non-ISO8601 formats, informing users about the position of the first error (:issue:`50361`) | ||
- Improved error message when trying to align :class:`DataFrame` objects (for example, in :func:`DataFrame.compare`) to clarify that "identically labelled" refers to both index and columns (:issue:`50083`) | ||
- Performance improvement in :func:`to_datetime` when format is given or can be inferred (:issue:`50465`) | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. We are in the processing of cutting 2.0.0 now; at this point should target 2.1.0 for this PR There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. @kostyafarber, hi! Can you merge main in issue-50395? There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Yep will do There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. @dimastbk merged! There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Looks like this is still in the v2.0.0 whatsnew There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Yes, I asked for merging main in pr branch. Now fixed. There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. merged. There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. You should revert any changes to this file - the v2.0.0.rst file shouldn't be touched as part of this PR |
||
- Added support for :meth:`Index.min` and :meth:`Index.max` for pyarrow string dtypes (:issue:`51397`) | ||
- Added :meth:`DatetimeIndex.as_unit` and :meth:`TimedeltaIndex.as_unit` to convert to different resolutions; supported resolutions are "s", "ms", "us", and "ns" (:issue:`50616`) | ||
- Added :meth:`Series.dt.unit` and :meth:`Series.dt.as_unit` to convert to different resolutions; supported resolutions are "s", "ms", "us", and "ns" (:issue:`51223`) | ||
|
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,99 @@ | ||
from __future__ import annotations | ||
|
||
from datetime import ( | ||
date, | ||
datetime, | ||
time, | ||
) | ||
from typing import ( | ||
TYPE_CHECKING, | ||
Union, | ||
) | ||
|
||
from pandas.compat._optional import import_optional_dependency | ||
from pandas.util._decorators import doc | ||
|
||
import pandas as pd | ||
from pandas.core.shared_docs import _shared_docs | ||
|
||
from pandas.io.excel._base import BaseExcelReader | ||
|
||
if TYPE_CHECKING: | ||
from pandas._typing import ( | ||
FilePath, | ||
ReadBuffer, | ||
Scalar, | ||
StorageOptions, | ||
) | ||
|
||
_CellValueT = Union[int, float, str, bool, time, date, datetime] | ||
|
||
|
||
class CalamineExcelReader(BaseExcelReader): | ||
@doc(storage_options=_shared_docs["storage_options"]) | ||
def __init__( | ||
self, | ||
filepath_or_buffer: FilePath | ReadBuffer[bytes], | ||
storage_options: StorageOptions = None, | ||
) -> None: | ||
""" | ||
Reader using calamine engine (xlsx/xls/xlsb/ods). | ||
|
||
Parameters | ||
---------- | ||
filepath_or_buffer : str, path to be parsed or | ||
an open readable stream. | ||
{storage_options} | ||
""" | ||
import_optional_dependency("python_calamine") | ||
super().__init__(filepath_or_buffer, storage_options=storage_options) | ||
|
||
@property | ||
def _workbook_class(self): | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. The protocol must not be explicitly stated in code, but whatever is returned here is supposed to represent the concept of a Workbook. Not very familiar with calamine but the name There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Fixed. |
||
from python_calamine import CalamineWorkbook | ||
|
||
return CalamineWorkbook | ||
|
||
def load_workbook(self, filepath_or_buffer: FilePath | ReadBuffer[bytes]): | ||
from python_calamine import load_workbook | ||
|
||
return load_workbook(filepath_or_buffer) # type: ignore[arg-type] | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Can you advise what the mypy errors are for this and the subsequent ones? Not necessarily a blocker but surprised to see these There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. pyright:
mypy:
|
||
|
||
@property | ||
def sheet_names(self) -> list[str]: | ||
return self.book.sheet_names # pyright: ignore | ||
|
||
def get_sheet_by_name(self, name: str): | ||
self.raise_if_bad_sheet_by_name(name) | ||
return self.book.get_sheet_by_name(name) # pyright: ignore | ||
|
||
def get_sheet_by_index(self, index: int): | ||
self.raise_if_bad_sheet_by_index(index) | ||
return self.book.get_sheet_by_index(index) # pyright: ignore | ||
|
||
def get_sheet_data( | ||
self, sheet, file_rows_needed: int | None = None | ||
) -> list[list[Scalar]]: | ||
def _convert_cell(value: _CellValueT) -> Scalar: | ||
if isinstance(value, float): | ||
val = int(value) | ||
if val == value: | ||
return val | ||
else: | ||
return value | ||
elif isinstance(value, date): | ||
return pd.Timestamp(value) | ||
elif isinstance(value, time): | ||
return value.isoformat() | ||
|
||
return value | ||
|
||
rows: list[list[_CellValueT]] = sheet.to_python(skip_empty_area=False) | ||
data: list[list[Scalar]] = [] | ||
|
||
for row in rows: | ||
data.append([_convert_cell(cell) for cell in row]) | ||
if file_rows_needed is not None and len(data) >= file_rows_needed: | ||
break | ||
|
||
return data |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is the datetime issue the only limitation? If so we can probably be more explicit and say something like
python-calamine can be used to read all formats, but specifically does not support reading datetimes from .xls and .xlsb formats
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Datetime is the main limitation, but there are a few more bugs, #50581 (comment). I suppressed them all with pytest.xfail, but should I write about them in documentation?