ENH:read_html() handles tables with multiple header rows #13434 #15242

brianhuey · 2017-01-27T18:08:49Z

closes read_html() doesn't handle tables with multiple header rows #13434
2 tests added / passed
Multiple tr rows within a thead are now parsed, creating a multiindex header.

codecov-io · 2017-01-27T22:04:08Z

Codecov Report

Merging #15242 into master will decrease coverage by 0.01%.
The diff coverage is 100%.

@@            Coverage Diff             @@
##           master   #15242      +/-   ##
==========================================
- Coverage   90.97%   90.95%   -0.02%     
==========================================
  Files         143      143              
  Lines       49429    49442      +13     
==========================================
+ Hits        44970    44972       +2     
- Misses       4459     4470      +11

Flag	Coverage Δ
#multiple	`88.71% <100%> (-0.01%)`	⬇️
#single	`40.69% <0%> (-0.12%)`	⬇️

Impacted Files	Coverage Δ
pandas/io/html.py	`84.81% <100%> (+0.32%)`	⬆️
pandas/io/gbq.py	`25% <0%> (-58.34%)`	⬇️
pandas/core/common.py	`90.96% <0%> (-0.34%)`	⬇️
pandas/core/frame.py	`97.56% <0%> (-0.1%)`	⬇️

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update abf1697...fc1c80e. Read the comment docs.

jreback

pls add a note in whatsnew (enhancements for 0.20)

pls update doc-string in pd.read_html and docs in io.rst (mini-example)

is there a real-world (shorter) url that this can be tested on (with the network decorator)?

jreback · 2017-02-01T20:48:43Z

pandas/io/tests/test_html.py

@@ -760,6 +760,17 @@ def test_keep_default_na(self):
        html_df = read_html(html_data, keep_default_na=True)[0]
        tm.assert_frame_equal(expected_df, html_df)

+    def test_multiple_header_rows(self):
+        expected_df = DataFrame(data=[("Hillary", 68, "D"),


can you add the issue number as a comment here

jreback · 2017-02-01T20:48:53Z

pandas/io/tests/test_html.py

@@ -869,6 +880,17 @@ def test_computer_sales_page(self):
        data = os.path.join(DATA_PATH, 'computer_sales_page.html')
        self.read_html(data, header=[0, 1])

+    def test_multiple_header_rows(self):
+        expected_df = DataFrame(data=[("Hillary", 68, "D"),


brianhuey · 2017-02-01T22:36:40Z

I looked for a good real-world url to write a test around, but a lot of the examples I could find seemed like they were from websites where the data may change over time, causing the test to fail. If you want I could add an html file to /io/tests/data and write something around that.

jreback · 2017-02-01T22:57:47Z

hmm, you are not specifying the header here, so this can infer automatically? (that its a multi-index)?

brianhuey · 2017-02-01T23:07:40Z

Yes, for instances where a <table> contains a <thead> and multiple <tr>, _data_to_frame() infers all header rows whereas currently it only infers the first row, omitting the rest from the dataframe.

jreback · 2017-02-01T23:14:56Z

@brianhuey ok, please make a note of this in doc-string and show an example in io.rst.

do we ever need to turn this off? (e.g. false positives)?

brianhuey · 2017-02-02T00:12:41Z

Currently there is no way to turn off the automatic parsing of <thead> elements in to column headers. Since this enhancement only makes that parsing more complete, I would say that adding a parameter might not be necessary. It could be manually turned off by converting the multiindex to a dataframe and concatenating it with the read_html() dataframe.

jreback · 2017-02-16T17:53:49Z

can you rebase (tests have moved).

@jorisvandenbossche @TomAugspurger comments?

…3434

jreback · 2017-03-23T13:17:24Z

@brianhuey looking at the original issue. This does work, though still lots of unnamed levels.

Is this expected?

In [1]: df = pd.DataFrame(
   ...:     columns=["Name", "Age", "Party"], 
   ...:     data = [("Hillary", 68, "D"), ("Bernie", 74, "D"), ("Donald", 69, "R")])
   ...: df = df.set_index("Name")
   ...: html = df.to_html()
   ...: 

In [2]: pd.read_html(html)[0]
Out[2]: 
  Unnamed: 0_level_0                Age              Party
                Name Unnamed: 1_level_1 Unnamed: 2_level_1
0            Hillary                 68                  D
1             Bernie                 74                  D
2             Donald                 69                  R

brianhuey · 2017-03-23T15:58:31Z

Yes, I could rename the unnamed levels to a blank string, but I was trying to maintain consistency with the original HTML parser which names blank column headers to 'Unnamed'. I assume it was done this way originally to remove any ambiguity amongst column names. I'm open to any suggestions on how to make this better.

jreback · 2017-03-23T19:37:33Z

@brianhuey I don't think thats right actually.

In [2]: df
Out[2]: 
         Age Party
Name              
Hillary   68     D
Bernie    74     D
Donald    69     R

This is not a multi-line header. Rather Name is the index name. I think the to_html() is correct as well.

This is a case that you can actually detect (and we do in csv reading and such).
IOW If you have ALL empty string levels that you don't actually have a multi-line header.
In fact I would make the user specify it anyhow (IOW, headers=None) means only a single header.

headers=[0,1] is very explict and that is the ONLY way to create a multi-line header.

So I would say that you are actually reading a single line header, but with an index label. Which I can't believe is actually that common (except if we are generating it).

brianhuey · 2017-03-23T19:57:52Z

Correct, the test DataFrame has index 'Name' and cols 'Age' and 'Party'. When this is converted to HTML it becomes a table with headers that look something like this:
<thead>
<tr>
<th></th><th>Age</th><th>Party</th>
</tr>
<tr>
<th>Name</th><th></th><th></th>
</tr>
</thead>
Which is essentially the multi-line HTML table header described in https://github.com/pandas-dev/pandas/issues/13434. You could imagine a similar table that looks like this:
<thead>
<tr>
<th></th><th>Age</th><th>Party</th><th>Party</th>
</tr>
<tr>
<th>Name</th><th></th><th></th><th></th>
</tr>
</thead>
My understanding is that selecting a specific 'Party' column becomes ambiguous unless you select by position. This was my rationale for using 'Unnamed' rather than making it look better by inserting blank strings.

jreback · 2017-03-23T20:22:50Z

@brianhuey can you post rendered versions of those as well (in the issue)

brianhuey · 2017-03-23T20:33:13Z

	Age	Party
Name
Hillary	68	D
Bernie	74	D
Donald	69	R

	Age	Party	Party
Name
Hillary	68	D	B
Bernie	74	D	C
Donald	69	R	F

jreback · 2017-03-23T20:50:46Z

@brianhuey ok I think you can then treat this like we do in read_csv. Assuming that header is an int (or None), and NOT a list. If you find the pattern like you do above then you will have a named Index.

brianhuey · 2017-03-23T21:53:27Z

In the case of a multi-row HTML header, wouldn't we want to pass a list of header rows? For instance, in the function below, head contains our two HTML header rows: [['','Age','Party'], ['Name','',''] These are appended to the dataframe, and then specified (header=[0,1]) in TextParser.

def _data_to_frame(**kwargs):
    head, body, foot = kwargs.pop('data')
    header = kwargs.pop('header')
    kwargs['skiprows'] = _get_skiprows(kwargs['skiprows'])
    if head:
        rows = lrange(len(head))
        body = head + body
        if header is None:  # special case when a table has <th> elements
            header = 0 if rows == [0] else rows

    if foot:
        body += [foot]

    # fill out elements of body that are "ragged"
    _expand_elements(body)
    tp = TextParser(body, header=header, **kwargs)
    df = tp.read()
    return df

jreback · 2017-03-23T21:56:31Z

@brianhuey

I wouldn't expect to need to pass anything in this case for header (or header=0) is equivalent. I think that it should/can infer that the Name row is actually an index label. The logic is not too complex and we do this in csv parsing. (in fact we accept both cases, IOW a named Index and no row at all between the header and the data).

This IMHO is NOT a multi-header at all.

brianhuey · 2017-03-23T22:05:15Z

But what about HTML tables where there are actually two rows of headers (i.e. two within a tag). The original issue raised was that pandas does not parse these multi-header row HTML tables correctly. If we are dealing with an HTML table like the one below, I would expect the parser would not automatically set the "Name" column to be the index, but I would expect the parser to identify the first two rows of the table as column headers and set those accordingly.

Zero	Age	Party
Name	One	Two
Hillary	68	D
Bernie	74	D
Donald	69	R

jreback · 2017-03-23T22:18:40Z

@brianhuey so for that one, I would say that you would have to specify header=[0,1]

jorisvandenbossche · 2017-03-23T23:34:34Z

I am not sure if it is worth adding this complexity to read_html of inferring index names due to strange formatting of the headers. That is something pandas-specific, and read_html is not really regarded as something you store data to later read it again (as opposed to csv).
I personally think that the last example of @brianhuey (two header levels) is indeed something that read_html could do automatically to set the two column rows as the MultiIndexed columns, as this is actually specified by the html source.

jreback · 2017-03-23T23:57:24Z

@brianhuey ok with @jorisvandenbossche suggestion. What we want is to have a simple / understandable approach that is the least magical as possible, but at the same time can get the job done.

brianhuey · 2017-03-24T00:05:01Z

I totally agree. The last table example is the one that I had in mind when I started out. Perhaps the test example confuses things as because indexing the 'Name' column is only done in order to generate a two-row HTML header when df.to_html() is called. I could easily add the last table example as the test case instead.

AashitaK · 2017-03-24T01:32:41Z

I am having hard time reading the table with multi-indexed columns from census India using read_html. I just found the issue reported here and the discussion about how the original example reported was not multi-indexed, so thought it might be useful to mention this example.

brianhuey · 2017-03-24T15:49:53Z

@AashitaK the table you linked is a little more complex than the ones I'm trying to handle here as it has multiple column headers but they span a variable number of columns. I am interested in tackling this problem next.

AashitaK · 2017-03-25T04:42:46Z

@brianhuey Thanks.

jreback · 2017-03-29T19:38:54Z

@brianhuey ?

brianhuey · 2017-03-29T19:46:16Z

I believe the code achieves what @jorisvandenbossche suggests. It reads multiple header rows from HTML syntax, and parses them as a multiindex column, producing a dataframe with columns that match the structure of an HTML table. I'm open to any suggestions on where to improve how this is handled (i.e. the 'Unnamed: level' thing, specifying multiple headers).

jreback · 2017-03-29T20:33:02Z

pandas/tests/io/test_html.py

@@ -869,6 +881,18 @@ def test_computer_sales_page(self):
        data = os.path.join(DATA_PATH, 'computer_sales_page.html')
        self.read_html(data, header=[0, 1])

+    def test_multiple_header_rows(self):
+        # Issue #13434
+        expected_df = DataFrame(data=[("Hillary", 68, "D"),


is this test in here twice?

Yes, it tests it twice: once using the BeautifulSoup parser and again with lxml.

whoosh we should never actually duplicate the test code (not sure if this is existing)

rather use a base class (or parametization)

if this is the current style can u pls create an issue to fix this (if not pls fix this!)

I see, I misunderstood what was going on there. I'll fix.

jreback · 2017-03-29T23:28:35Z

thanks @brianhuey

followups always welcome!

brianhuey · 2017-03-29T23:31:07Z

I appreciate you working with me through this. First time contributing to an open source project!

jreback · 2017-03-29T23:32:26Z

@brianhuey you did great!

keep em coming!

…13434 closes pandas-dev#13434 Author: Brian <[email protected]> Author: S. Brian Huey <[email protected]> Closes pandas-dev#15242 from brianhuey/thead-improvement and squashes the following commits: fc1c80e [S. Brian Huey] Merge branch 'master' into thead-improvement b54aa0c [Brian] removed duplicate test case 6ae2860 [Brian] updated docstring and io.rst 41fe8cd [Brian] review changes 873ea58 [Brian] switched from range to lrange cd70225 [Brian] ENH:read_html() handles tables with multiple header rows pandas-dev#13434

jreback added IO HTML read_html, to_html, Styler.apply, Styler.applymap Enhancement labels Jan 27, 2017

jreback requested changes Feb 1, 2017

View reviewed changes

brianhuey force-pushed the thead-improvement branch from 6cd6aaf to ff59777 Compare February 1, 2017 22:34

brianhuey force-pushed the thead-improvement branch from ff59777 to 93022fd Compare February 2, 2017 00:36

jreback added this to the 0.20.0 milestone Feb 16, 2017

brianhuey added 4 commits February 16, 2017 14:59

ENH:read_html() handles tables with multiple header rows pandas-dev#1…

cd70225

…3434

switched from range to lrange

873ea58

review changes

41fe8cd

updated docstring and io.rst

6ae2860

brianhuey force-pushed the thead-improvement branch from 93022fd to 6ae2860 Compare February 16, 2017 23:10

jreback reviewed Mar 29, 2017

View reviewed changes

removed duplicate test case

b54aa0c

brianhuey force-pushed the thead-improvement branch from 8e7b03e to b54aa0c Compare March 29, 2017 21:09

Merge branch 'master' into thead-improvement

fc1c80e

jreback approved these changes Mar 29, 2017

View reviewed changes

jreback closed this in 0ab0813 Mar 29, 2017

adamhooper mentioned this pull request Jun 21, 2018

read_html: Handle colspan and rowspan #21487

Merged

5 tasks

Uh oh!

ENH:read_html() handles tables with multiple header rows #13434 #15242

ENH:read_html() handles tables with multiple header rows #13434 #15242

Uh oh!

Conversation

brianhuey commented Jan 27, 2017 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

codecov-io commented Jan 27, 2017 • edited by codecov bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Codecov Report

Uh oh!

jreback left a comment

Choose a reason for hiding this comment

Uh oh!

jreback Feb 1, 2017

Choose a reason for hiding this comment

Uh oh!

jreback Feb 1, 2017

Choose a reason for hiding this comment

Uh oh!

brianhuey commented Feb 1, 2017

Uh oh!

jreback commented Feb 1, 2017

Uh oh!

brianhuey commented Feb 1, 2017

Uh oh!

jreback commented Feb 1, 2017

Uh oh!

brianhuey commented Feb 2, 2017

Uh oh!

jreback commented Feb 16, 2017

Uh oh!

jreback commented Mar 23, 2017

Uh oh!

brianhuey commented Mar 23, 2017

Uh oh!

jreback commented Mar 23, 2017

Uh oh!

brianhuey commented Mar 23, 2017

Uh oh!

jreback commented Mar 23, 2017

Uh oh!

brianhuey commented Mar 23, 2017

Uh oh!

jreback commented Mar 23, 2017

Uh oh!

brianhuey commented Mar 23, 2017

Uh oh!

jreback commented Mar 23, 2017

Uh oh!

brianhuey commented Mar 23, 2017

Uh oh!

jreback commented Mar 23, 2017

Uh oh!

jorisvandenbossche commented Mar 23, 2017

Uh oh!

jreback commented Mar 23, 2017

Uh oh!

brianhuey commented Mar 24, 2017

Uh oh!

AashitaK commented Mar 24, 2017

Uh oh!

brianhuey commented Mar 24, 2017

Uh oh!

AashitaK commented Mar 25, 2017

Uh oh!

jreback commented Mar 29, 2017

Uh oh!

brianhuey commented Mar 29, 2017

Uh oh!

jreback Mar 29, 2017

Choose a reason for hiding this comment

Uh oh!

brianhuey Mar 29, 2017

Choose a reason for hiding this comment

Uh oh!

jreback Mar 29, 2017

Choose a reason for hiding this comment

Uh oh!

brianhuey commented Jan 27, 2017 •

edited

Loading

codecov-io commented Jan 27, 2017 •

edited by codecov bot

Loading