Skip to content

ENH:read_html() handles tables with multiple header rows #13434 #15242

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
wants to merge 6 commits into from
Closed
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
7 changes: 4 additions & 3 deletions doc/source/io.rst
Original file line number Diff line number Diff line change
Expand Up @@ -2232,9 +2232,10 @@ Read a URL and match a table that contains specific text
match = 'Metcalf Bank'
df_list = pd.read_html(url, match=match)

Specify a header row (by default ``<th>`` elements are used to form the column
index); if specified, the header row is taken from the data minus the parsed
header elements (``<th>`` elements).
Specify a header row (by default ``<th>`` or ``<td>`` elements located within a
``<thead>`` are used to form the column index, if multiple rows are contained within
``<thead>`` then a multiindex is created); if specified, the header row is taken
from the data minus the parsed header elements (``<th>`` elements).

.. code-block:: python

Expand Down
1 change: 1 addition & 0 deletions doc/source/whatsnew/v0.20.0.txt
Original file line number Diff line number Diff line change
Expand Up @@ -309,6 +309,7 @@ Other Enhancements
- ``pandas.tools.hashing`` has gained a ``hash_tuples`` routine, and ``hash_pandas_object`` has gained the ability to hash a ``MultiIndex`` (:issue:`15224`)
- ``Series/DataFrame.squeeze()`` have gained the ``axis`` parameter. (:issue:`15339`)
- ``DataFrame.to_excel()`` has a new ``freeze_panes`` parameter to turn on Freeze Panes when exporting to Excel (:issue:`15160`)
- ``pd.read_html()`` parses multiple header rows, creating a multiindex header. (:issue:`13434`).
- HTML table output skips ``colspan`` or ``rowspan`` attribute if equal to 1. (:issue:`15403`)

- ``pd.TimedeltaIndex`` now has a custom datetick formatter specifically designed for nanosecond level precision (:issue:`8711`)
Expand Down
31 changes: 20 additions & 11 deletions pandas/io/html.py
Original file line number Diff line number Diff line change
Expand Up @@ -355,9 +355,12 @@ def _parse_raw_thead(self, table):
thead = self._parse_thead(table)
res = []
if thead:
res = lmap(self._text_getter, self._parse_th(thead[0]))
return np.atleast_1d(
np.array(res).squeeze()) if res and len(res) == 1 else res
trs = self._parse_tr(thead[0])
for tr in trs:
cols = lmap(self._text_getter, self._parse_td(tr))
if any([col != '' for col in cols]):
res.append(cols)
return res

def _parse_raw_tfoot(self, table):
tfoot = self._parse_tfoot(table)
Expand Down Expand Up @@ -591,9 +594,17 @@ def _parse_tfoot(self, table):
return table.xpath('.//tfoot')

def _parse_raw_thead(self, table):
expr = './/thead//th'
return [_remove_whitespace(x.text_content()) for x in
table.xpath(expr)]
expr = './/thead'
thead = table.xpath(expr)
res = []
if thead:
trs = self._parse_tr(thead[0])
for tr in trs:
cols = [_remove_whitespace(x.text_content()) for x in
self._parse_td(tr)]
if any([col != '' for col in cols]):
res.append(cols)
return res

def _parse_raw_tfoot(self, table):
expr = './/tfoot//th|//tfoot//td'
Expand All @@ -615,19 +626,17 @@ def _data_to_frame(**kwargs):
head, body, foot = kwargs.pop('data')
header = kwargs.pop('header')
kwargs['skiprows'] = _get_skiprows(kwargs['skiprows'])

if head:
body = [head] + body

rows = lrange(len(head))
body = head + body
if header is None: # special case when a table has <th> elements
header = 0
header = 0 if rows == [0] else rows

if foot:
body += [foot]

# fill out elements of body that are "ragged"
_expand_elements(body)

tp = TextParser(body, header=header, **kwargs)
df = tp.read()
return df
Expand Down
12 changes: 12 additions & 0 deletions pandas/tests/io/test_html.py
Original file line number Diff line number Diff line change
Expand Up @@ -760,6 +760,18 @@ def test_keep_default_na(self):
html_df = read_html(html_data, keep_default_na=True)[0]
tm.assert_frame_equal(expected_df, html_df)

def test_multiple_header_rows(self):
# Issue #13434
expected_df = DataFrame(data=[("Hillary", 68, "D"),
("Bernie", 74, "D"),
("Donald", 69, "R")])
expected_df.columns = [["Unnamed: 0_level_0", "Age", "Party"],
["Name", "Unnamed: 1_level_1",
"Unnamed: 2_level_1"]]
html = expected_df.to_html(index=False)
html_df = read_html(html, )[0]
tm.assert_frame_equal(expected_df, html_df)


def _lang_enc(filename):
return os.path.splitext(os.path.basename(filename))[0].split('_')
Expand Down