Skip to content

ENH:read_html() handles tables with multiple header rows #13434 #15242

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
wants to merge 6 commits into from

Conversation

brianhuey
Copy link
Contributor

@brianhuey brianhuey commented Jan 27, 2017

@jreback jreback added IO HTML read_html, to_html, Styler.apply, Styler.applymap Enhancement labels Jan 27, 2017
@codecov-io
Copy link

codecov-io commented Jan 27, 2017

Codecov Report

Merging #15242 into master will decrease coverage by 0.01%.
The diff coverage is 100%.

Impacted file tree graph

@@            Coverage Diff             @@
##           master   #15242      +/-   ##
==========================================
- Coverage   90.97%   90.95%   -0.02%     
==========================================
  Files         143      143              
  Lines       49429    49442      +13     
==========================================
+ Hits        44970    44972       +2     
- Misses       4459     4470      +11
Flag Coverage Δ
#multiple 88.71% <100%> (-0.01%) ⬇️
#single 40.69% <0%> (-0.12%) ⬇️
Impacted Files Coverage Δ
pandas/io/html.py 84.81% <100%> (+0.32%) ⬆️
pandas/io/gbq.py 25% <0%> (-58.34%) ⬇️
pandas/core/common.py 90.96% <0%> (-0.34%) ⬇️
pandas/core/frame.py 97.56% <0%> (-0.1%) ⬇️

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update abf1697...fc1c80e. Read the comment docs.

Copy link
Contributor

@jreback jreback left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

pls add a note in whatsnew (enhancements for 0.20)

pls update doc-string in pd.read_html and docs in io.rst (mini-example)

is there a real-world (shorter) url that this can be tested on (with the network decorator)?

@@ -760,6 +760,17 @@ def test_keep_default_na(self):
html_df = read_html(html_data, keep_default_na=True)[0]
tm.assert_frame_equal(expected_df, html_df)

def test_multiple_header_rows(self):
expected_df = DataFrame(data=[("Hillary", 68, "D"),
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can you add the issue number as a comment here

@@ -869,6 +880,17 @@ def test_computer_sales_page(self):
data = os.path.join(DATA_PATH, 'computer_sales_page.html')
self.read_html(data, header=[0, 1])

def test_multiple_header_rows(self):
expected_df = DataFrame(data=[("Hillary", 68, "D"),
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

same here

@brianhuey
Copy link
Contributor Author

I looked for a good real-world url to write a test around, but a lot of the examples I could find seemed like they were from websites where the data may change over time, causing the test to fail. If you want I could add an html file to /io/tests/data and write something around that.

@jreback
Copy link
Contributor

jreback commented Feb 1, 2017

hmm, you are not specifying the header here, so this can infer automatically? (that its a multi-index)?

@brianhuey
Copy link
Contributor Author

Yes, for instances where a <table> contains a <thead> and multiple <tr>, _data_to_frame() infers all header rows whereas currently it only infers the first row, omitting the rest from the dataframe.

@jreback
Copy link
Contributor

jreback commented Feb 1, 2017

@brianhuey ok, please make a note of this in doc-string and show an example in io.rst.

do we ever need to turn this off? (e.g. false positives)?

@brianhuey
Copy link
Contributor Author

Currently there is no way to turn off the automatic parsing of <thead> elements in to column headers. Since this enhancement only makes that parsing more complete, I would say that adding a parameter might not be necessary. It could be manually turned off by converting the multiindex to a dataframe and concatenating it with the read_html() dataframe.

@jreback jreback added this to the 0.20.0 milestone Feb 16, 2017
@jreback
Copy link
Contributor

jreback commented Feb 16, 2017

can you rebase (tests have moved).

@jorisvandenbossche @TomAugspurger comments?

@jreback
Copy link
Contributor

jreback commented Mar 23, 2017

@brianhuey looking at the original issue. This does work, though still lots of unnamed levels.

Is this expected?

In [1]: df = pd.DataFrame(
   ...:     columns=["Name", "Age", "Party"], 
   ...:     data = [("Hillary", 68, "D"), ("Bernie", 74, "D"), ("Donald", 69, "R")])
   ...: df = df.set_index("Name")
   ...: html = df.to_html()
   ...: 

In [2]: pd.read_html(html)[0]
Out[2]: 
  Unnamed: 0_level_0                Age              Party
                Name Unnamed: 1_level_1 Unnamed: 2_level_1
0            Hillary                 68                  D
1             Bernie                 74                  D
2             Donald                 69                  R

@brianhuey
Copy link
Contributor Author

Yes, I could rename the unnamed levels to a blank string, but I was trying to maintain consistency with the original HTML parser which names blank column headers to 'Unnamed'. I assume it was done this way originally to remove any ambiguity amongst column names. I'm open to any suggestions on how to make this better.

@jreback
Copy link
Contributor

jreback commented Mar 23, 2017

@brianhuey I don't think thats right actually.

In [2]: df
Out[2]: 
         Age Party
Name              
Hillary   68     D
Bernie    74     D
Donald    69     R

This is not a multi-line header. Rather Name is the index name. I think the to_html() is correct as well.

This is a case that you can actually detect (and we do in csv reading and such).
IOW If you have ALL empty string levels that you don't actually have a multi-line header.
In fact I would make the user specify it anyhow (IOW, headers=None) means only a single header.

headers=[0,1] is very explict and that is the ONLY way to create a multi-line header.

So I would say that you are actually reading a single line header, but with an index label. Which I can't believe is actually that common (except if we are generating it).

@brianhuey
Copy link
Contributor Author

Correct, the test DataFrame has index 'Name' and cols 'Age' and 'Party'. When this is converted to HTML it becomes a table with headers that look something like this:
<thead>
<tr>
<th></th><th>Age</th><th>Party</th>
</tr>
<tr>
<th>Name</th><th></th><th></th>
</tr>
</thead>
Which is essentially the multi-line HTML table header described in https://github.com/pandas-dev/pandas/issues/13434. You could imagine a similar table that looks like this:
<thead>
<tr>
<th></th><th>Age</th><th>Party</th><th>Party</th>
</tr>
<tr>
<th>Name</th><th></th><th></th><th></th>
</tr>
</thead>
My understanding is that selecting a specific 'Party' column becomes ambiguous unless you select by position. This was my rationale for using 'Unnamed' rather than making it look better by inserting blank strings.

@jreback
Copy link
Contributor

jreback commented Mar 23, 2017

@brianhuey can you post rendered versions of those as well (in the issue)

@brianhuey
Copy link
Contributor Author

Age Party
Name
Hillary 68 D
Bernie 74 D
Donald 69 R
Age Party Party
Name
Hillary 68 D B
Bernie 74 D C
Donald 69 R F

@jreback
Copy link
Contributor

jreback commented Mar 23, 2017

@brianhuey ok I think you can then treat this like we do in read_csv. Assuming that header is an int (or None), and NOT a list. If you find the pattern like you do above then you will have a named Index.

@brianhuey
Copy link
Contributor Author

In the case of a multi-row HTML header, wouldn't we want to pass a list of header rows? For instance, in the function below, head contains our two HTML header rows: [['','Age','Party'], ['Name','',''] These are appended to the dataframe, and then specified (header=[0,1]) in TextParser.

def _data_to_frame(**kwargs):
    head, body, foot = kwargs.pop('data')
    header = kwargs.pop('header')
    kwargs['skiprows'] = _get_skiprows(kwargs['skiprows'])
    if head:
        rows = lrange(len(head))
        body = head + body
        if header is None:  # special case when a table has <th> elements
            header = 0 if rows == [0] else rows

    if foot:
        body += [foot]

    # fill out elements of body that are "ragged"
    _expand_elements(body)
    tp = TextParser(body, header=header, **kwargs)
    df = tp.read()
    return df

@jreback
Copy link
Contributor

jreback commented Mar 23, 2017

@brianhuey

I wouldn't expect to need to pass anything in this case for header (or header=0) is equivalent. I think that it should/can infer that the Name row is actually an index label. The logic is not too complex and we do this in csv parsing. (in fact we accept both cases, IOW a named Index and no row at all between the header and the data).

This IMHO is NOT a multi-header at all.

@brianhuey
Copy link
Contributor Author

But what about HTML tables where there are actually two rows of headers (i.e. two within a tag). The original issue raised was that pandas does not parse these multi-header row HTML tables correctly. If we are dealing with an HTML table like the one below, I would expect the parser would not automatically set the "Name" column to be the index, but I would expect the parser to identify the first two rows of the table as column headers and set those accordingly.

Zero Age Party
Name One Two
Hillary 68 D
Bernie 74 D
Donald 69 R

@jreback
Copy link
Contributor

jreback commented Mar 23, 2017

@brianhuey so for that one, I would say that you would have to specify header=[0,1]

@jorisvandenbossche
Copy link
Member

I am not sure if it is worth adding this complexity to read_html of inferring index names due to strange formatting of the headers. That is something pandas-specific, and read_html is not really regarded as something you store data to later read it again (as opposed to csv).
I personally think that the last example of @brianhuey (two header levels) is indeed something that read_html could do automatically to set the two column rows as the MultiIndexed columns, as this is actually specified by the html source.

@jreback
Copy link
Contributor

jreback commented Mar 23, 2017

@brianhuey ok with @jorisvandenbossche suggestion. What we want is to have a simple / understandable approach that is the least magical as possible, but at the same time can get the job done.

@brianhuey
Copy link
Contributor Author

I totally agree. The last table example is the one that I had in mind when I started out. Perhaps the test example confuses things as because indexing the 'Name' column is only done in order to generate a two-row HTML header when df.to_html() is called. I could easily add the last table example as the test case instead.

@AashitaK
Copy link

I am having hard time reading the table with multi-indexed columns from census India using read_html. I just found the issue reported here and the discussion about how the original example reported was not multi-indexed, so thought it might be useful to mention this example.

@brianhuey
Copy link
Contributor Author

@AashitaK the table you linked is a little more complex than the ones I'm trying to handle here as it has multiple column headers but they span a variable number of columns. I am interested in tackling this problem next.

@AashitaK
Copy link

@brianhuey Thanks.

@jreback
Copy link
Contributor

jreback commented Mar 29, 2017

@brianhuey ?

@brianhuey
Copy link
Contributor Author

I believe the code achieves what @jorisvandenbossche suggests. It reads multiple header rows from HTML syntax, and parses them as a multiindex column, producing a dataframe with columns that match the structure of an HTML table. I'm open to any suggestions on where to improve how this is handled (i.e. the 'Unnamed: level' thing, specifying multiple headers).

@@ -869,6 +881,18 @@ def test_computer_sales_page(self):
data = os.path.join(DATA_PATH, 'computer_sales_page.html')
self.read_html(data, header=[0, 1])

def test_multiple_header_rows(self):
# Issue #13434
expected_df = DataFrame(data=[("Hillary", 68, "D"),
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

is this test in here twice?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, it tests it twice: once using the BeautifulSoup parser and again with lxml.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

whoosh we should never actually duplicate the test code (not sure if this is existing)

rather use a base class (or parametization)

if this is the current style can u pls create an issue to fix this (if not pls fix this!)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I see, I misunderstood what was going on there. I'll fix.

@jreback jreback closed this in 0ab0813 Mar 29, 2017
@jreback
Copy link
Contributor

jreback commented Mar 29, 2017

thanks @brianhuey

followups always welcome!

@brianhuey
Copy link
Contributor Author

I appreciate you working with me through this. First time contributing to an open source project!

@jreback
Copy link
Contributor

jreback commented Mar 29, 2017

@brianhuey you did great!

keep em coming!

mattip pushed a commit to mattip/pandas that referenced this pull request Apr 3, 2017
…13434

closes pandas-dev#13434

Author: Brian <[email protected]>
Author: S. Brian Huey <[email protected]>

Closes pandas-dev#15242 from brianhuey/thead-improvement and squashes the following commits:

fc1c80e [S. Brian Huey] Merge branch 'master' into thead-improvement
b54aa0c [Brian] removed duplicate test case
6ae2860 [Brian] updated docstring and io.rst
41fe8cd [Brian] review changes
873ea58 [Brian] switched from range to lrange
cd70225 [Brian] ENH:read_html() handles tables with multiple header rows pandas-dev#13434
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Enhancement IO HTML read_html, to_html, Styler.apply, Styler.applymap
Projects
None yet
Development

Successfully merging this pull request may close these issues.

read_html() doesn't handle tables with multiple header rows
5 participants