Skip to content

ENH: Add if_sheet_exists parameter to ExcelWriter #40231

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 21 commits into from
Apr 22, 2021
Merged
Show file tree
Hide file tree
Changes from 7 commits
Commits
Show all changes
21 commits
Select commit Hold shift + click to select a range
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
1 change: 1 addition & 0 deletions doc/source/whatsnew/v1.3.0.rst
Original file line number Diff line number Diff line change
Expand Up @@ -125,6 +125,7 @@ Other enhancements
- :func:`to_numeric` now supports downcasting of nullable ``ExtensionDtype`` objects (:issue:`33013`)
- Add support for dict-like names in :class:`MultiIndex.set_names` and :class:`MultiIndex.rename` (:issue:`20421`)
- :func:`pandas.read_excel` can now auto detect .xlsb files (:issue:`35416`)
- :class:`pandas.ExcelWriter` now accepts an ``if_sheet_exists`` parameter to control the behaviour of append mode when writing to existing sheets (:issue:`40230`)
- :meth:`.Rolling.sum`, :meth:`.Expanding.sum`, :meth:`.Rolling.mean`, :meth:`.Expanding.mean`, :meth:`.Rolling.median`, :meth:`.Expanding.median`, :meth:`.Rolling.max`, :meth:`.Expanding.max`, :meth:`.Rolling.min`, and :meth:`.Expanding.min` now support ``Numba`` execution with the ``engine`` keyword (:issue:`38895`)
- :meth:`DataFrame.apply` can now accept NumPy unary operators as strings, e.g. ``df.apply("sqrt")``, which was already the case for :meth:`Series.apply` (:issue:`39116`)
- :meth:`DataFrame.apply` can now accept non-callable DataFrame properties as strings, e.g. ``df.apply("size")``, which was already the case for :meth:`Series.apply` (:issue:`39116`)
Expand Down
21 changes: 21 additions & 0 deletions pandas/io/excel/_base.py
Original file line number Diff line number Diff line change
Expand Up @@ -666,6 +666,17 @@ class ExcelWriter(metaclass=abc.ABCMeta):
be parsed by ``fsspec``, e.g., starting "s3://", "gcs://".

.. versionadded:: 1.2.0
if_sheet_exists : {'new', 'replace', 'overwrite', 'fail'}, default 'new'
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

why is the default not 'error'?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Mainly to preserve the current behaviour as the default. Happy to change

How to behave when trying to write to a sheet that already
exists (append mode only).

* new: Create a new sheet with a different name.
* replace: Delete the contents of the sheet before writing to it.
* overwrite: Write directly to the named sheet
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What is the use case for overwrite?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Published statistics, at least in the UK, often have sheets which combine headings, date, description, data quality notes, etc with data tables. To automate something like this you would probably have a pre-written template and then write your data from pandas into specific sheets at specific locations.

An example I happened to be looking at recently, England's daily, weekly and monthly vaccination figures.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@mrob95 - does the current implementation in this PR overwrite the template formatting? E.g. if a column in the template is formatted to percent and I have a DataFrame with 0.5, will it be displayed in excel as 50%?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@rhshadrach It doesn't overwrite cell formatting unless there is an alternative style set. E.g. df.style.set_properties(**{"number-format": "0.00%"}) will overwrite number formatting but otherwise the written cells inherit the previous formatting, including conditional formatting.

The only exception to this I can see is headers and indexes, which have a hardcoded style which I think will always overwrite certain formats (see def header_style in io/formats/excel.py). This may not be ideal for certain use cases but seems like a separate issue, about which there is already discussion (#25185).

Between this and pandas own excel formatting options I think the options are pretty good for styling tables written using if_sheet_exists="overwrite".

without deleting the previous contents.
* fail: raise a ValueError.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

rename to 'error'

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Will do


.. versionadded:: 1.3.0

Attributes
----------
Expand Down Expand Up @@ -834,6 +845,7 @@ def __init__(
datetime_format=None,
mode: str = "w",
storage_options: StorageOptions = None,
if_sheet_exists: Optional[str] = None,
**engine_kwargs,
):
# validate that this engine can handle the extension
Expand Down Expand Up @@ -868,6 +880,15 @@ def __init__(

self.mode = mode

ise_valid = [None, "new", "replace", "overwrite", "fail"]
if if_sheet_exists not in ise_valid:
raise ValueError(f"'{if_sheet_exists}' is not valid for if_sheet_exists")
if if_sheet_exists and "r+" not in mode:
raise ValueError("if_sheet_exists is only valid in append mode (mode='a')")
if if_sheet_exists is None and "r+" in mode:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

why is the r+ condition here? (e.g. then if_sheet_exists) should always be a str at this point, right? (if its being written will be ignored anyhow), right?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

i guess not a big deal

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You're right, fixed

if_sheet_exists = "new"
self.if_sheet_exists = if_sheet_exists

def __fspath__(self):
return getattr(self.handles.handle, "name", "")

Expand Down
29 changes: 27 additions & 2 deletions pandas/io/excel/_openpyxl.py
Original file line number Diff line number Diff line change
Expand Up @@ -37,13 +37,18 @@ def __init__(
engine=None,
mode: str = "w",
storage_options: StorageOptions = None,
if_sheet_exists: Optional[str] = None,
**engine_kwargs,
):
# Use the openpyxl module as the Excel writer.
from openpyxl.workbook import Workbook

super().__init__(
path, mode=mode, storage_options=storage_options, **engine_kwargs
path,
mode=mode,
storage_options=storage_options,
if_sheet_exists=if_sheet_exists,
**engine_kwargs,
)

# ExcelWriter replaced "a" by "r+" to allow us to first read the excel file from
Expand All @@ -53,6 +58,8 @@ def __init__(

self.book = load_workbook(self.handles.handle)
self.handles.handle.seek(0)
self.sheets = {name: self.book[name] for name in self.book.sheetnames}

else:
# Create workbook object with default optimized_write=True.
self.book = Workbook()
Expand Down Expand Up @@ -412,7 +419,25 @@ def write_cells(
_style_cache: Dict[str, Dict[str, Serialisable]] = {}

if sheet_name in self.sheets:
wks = self.sheets[sheet_name]
if "r+" in self.mode:
if self.if_sheet_exists == "new":
wks = self.book.create_sheet()
# openpyxl will create a name for the new sheet by appending digits
wks.title = sheet_name
self.sheets[wks.title] = wks
elif self.if_sheet_exists == "replace":
wks = self.sheets[sheet_name]
wks.delete_cols(1, wks.max_column)
elif self.if_sheet_exists == "overwrite":
wks = self.sheets[sheet_name]
elif self.if_sheet_exists == "fail":
raise ValueError(f"Sheet '{sheet_name}' already exists.")
else:
raise ValueError(
f"'{self.if_sheet_exists}' is not valid for if_sheet_exists"
)
else:
wks = self.sheets[sheet_name]
else:
wks = self.book.create_sheet()
wks.title = sheet_name
Expand Down
58 changes: 58 additions & 0 deletions pandas/tests/io/excel/test_openpyxl.py
Original file line number Diff line number Diff line change
@@ -1,4 +1,5 @@
from pathlib import Path
import re

import numpy as np
import pytest
Expand Down Expand Up @@ -109,6 +110,63 @@ def test_write_append_mode(ext, mode, expected):
assert wb2.worksheets[index]["A1"].value == cell_value


@pytest.mark.parametrize(
"if_sheet_exists,num_sheets,expected",
[
("new", 2, ["apple", "banana"]),
(None, 2, ["apple", "banana"]),
("replace", 1, ["pear"]),
("overwrite", 1, ["pear", "banana"]),
],
)
def test_if_sheet_exists_append_modes(ext, if_sheet_exists, num_sheets, expected):
# GH 40230
df1 = DataFrame({"fruit": ["apple", "banana"]})
df2 = DataFrame({"fruit": ["pear"]})

with tm.ensure_clean(ext) as f:
df1.to_excel(f, engine="openpyxl", sheet_name="foo", index=False)
with pd.ExcelWriter(
f, engine="openpyxl", mode="a", if_sheet_exists=if_sheet_exists
) as writer:
df2.to_excel(writer, sheet_name="foo", index=False)

wb = openpyxl.load_workbook(f)
assert len(wb.sheetnames) == num_sheets
assert wb.sheetnames[0] == "foo"
result = pd.read_excel(wb, "foo", engine="openpyxl")
assert list(result["fruit"]) == expected
if len(wb.sheetnames) == 2:
# atm the name given for the second sheet will be "foo1"
# but we don't want the test to fail if openpyxl changes this
result = pd.read_excel(wb, wb.sheetnames[1], engine="openpyxl")
tm.assert_frame_equal(result, df2)
wb.close()


def test_if_sheet_exists_raises(ext):
mode_msg = "if_sheet_exists is only valid in append mode (mode='a')"
invalid_msg = "'invalid' is not valid for if_sheet_exists"
fail_msg = "Sheet 'foo' already exists."
df = DataFrame({"fruit": ["pear"]})

with tm.ensure_clean(ext) as f:
with pytest.raises(ValueError, match=re.escape(mode_msg)):
ExcelWriter(f, engine="openpyxl", mode="w", if_sheet_exists="new")

with tm.ensure_clean(ext) as f:
with pytest.raises(ValueError, match=invalid_msg):
ExcelWriter(f, engine="openpyxl", mode="a", if_sheet_exists="invalid")

with tm.ensure_clean(ext) as f:
with pytest.raises(ValueError, match=fail_msg):
df.to_excel(f, "foo", engine="openpyxl")
with pd.ExcelWriter(
f, engine="openpyxl", mode="a", if_sheet_exists="fail"
) as writer:
df.to_excel(writer, sheet_name="foo")


def test_to_excel_with_openpyxl_engine(ext):
# GH 29854
with tm.ensure_clean(ext) as filename:
Expand Down
4 changes: 4 additions & 0 deletions pandas/tests/io/excel/test_xlsxwriter.py
Original file line number Diff line number Diff line change
@@ -1,3 +1,4 @@
import re
import warnings

import pytest
Expand Down Expand Up @@ -57,7 +58,10 @@ def test_column_format(ext):

def test_write_append_mode_raises(ext):
msg = "Append mode is not supported with xlsxwriter!"
ise_msg = "if_sheet_exists is only valid in append mode (mode='a')"

with tm.ensure_clean(ext) as f:
with pytest.raises(ValueError, match=msg):
ExcelWriter(f, engine="xlsxwriter", mode="a")
with pytest.raises(ValueError, match=re.escape(ise_msg)):
ExcelWriter(f, engine="xlsxwriter", if_sheet_exists="replace")