-
-
Notifications
You must be signed in to change notification settings - Fork 18.5k
ENH:read_html() handles tables with multiple header rows #13434 #15242
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
Codecov Report
@@ Coverage Diff @@
## master #15242 +/- ##
==========================================
- Coverage 90.97% 90.95% -0.02%
==========================================
Files 143 143
Lines 49429 49442 +13
==========================================
+ Hits 44970 44972 +2
- Misses 4459 4470 +11
Continue to review full report at Codecov.
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
pls add a note in whatsnew (enhancements for 0.20)
pls update doc-string in pd.read_html and docs in io.rst (mini-example)
is there a real-world (shorter) url that this can be tested on (with the network decorator)?
pandas/io/tests/test_html.py
Outdated
@@ -760,6 +760,17 @@ def test_keep_default_na(self): | |||
html_df = read_html(html_data, keep_default_na=True)[0] | |||
tm.assert_frame_equal(expected_df, html_df) | |||
|
|||
def test_multiple_header_rows(self): | |||
expected_df = DataFrame(data=[("Hillary", 68, "D"), |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
can you add the issue number as a comment here
pandas/io/tests/test_html.py
Outdated
@@ -869,6 +880,17 @@ def test_computer_sales_page(self): | |||
data = os.path.join(DATA_PATH, 'computer_sales_page.html') | |||
self.read_html(data, header=[0, 1]) | |||
|
|||
def test_multiple_header_rows(self): | |||
expected_df = DataFrame(data=[("Hillary", 68, "D"), |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
same here
6cd6aaf
to
ff59777
Compare
I looked for a good real-world url to write a test around, but a lot of the examples I could find seemed like they were from websites where the data may change over time, causing the test to fail. If you want I could add an html file to /io/tests/data and write something around that. |
hmm, you are not specifying the header here, so this can infer automatically? (that its a multi-index)? |
Yes, for instances where a |
@brianhuey ok, please make a note of this in doc-string and show an example in io.rst. do we ever need to turn this off? (e.g. false positives)? |
Currently there is no way to turn off the automatic parsing of |
ff59777
to
93022fd
Compare
can you rebase (tests have moved). @jorisvandenbossche @TomAugspurger comments? |
93022fd
to
6ae2860
Compare
@brianhuey looking at the original issue. This does work, though still lots of unnamed levels. Is this expected?
|
Yes, I could rename the unnamed levels to a blank string, but I was trying to maintain consistency with the original HTML parser which names blank column headers to 'Unnamed'. I assume it was done this way originally to remove any ambiguity amongst column names. I'm open to any suggestions on how to make this better. |
@brianhuey I don't think thats right actually.
This is not a multi-line header. Rather This is a case that you can actually detect (and we do in csv reading and such).
So I would say that you are actually reading a single line header, but with an index label. Which I can't believe is actually that common (except if we are generating it). |
Correct, the test DataFrame has index 'Name' and cols 'Age' and 'Party'. When this is converted to HTML it becomes a table with headers that look something like this: |
@brianhuey can you post rendered versions of those as well (in the issue) |
|
@brianhuey ok I think you can then treat this like we do in read_csv. Assuming that |
In the case of a multi-row HTML header, wouldn't we want to pass a list of header rows? For instance, in the function below,
|
I wouldn't expect to need to pass anything in this case for This IMHO is NOT a multi-header at all. |
But what about HTML tables where there are actually two rows of headers (i.e. two within a tag). The original issue raised was that pandas does not parse these multi-header row HTML tables correctly. If we are dealing with an HTML table like the one below, I would expect the parser would not automatically set the "Name" column to be the index, but I would expect the parser to identify the first two rows of the table as column headers and set those accordingly.
|
@brianhuey so for that one, I would say that you would have to specify |
I am not sure if it is worth adding this complexity to read_html of inferring index names due to strange formatting of the headers. That is something pandas-specific, and |
@brianhuey ok with @jorisvandenbossche suggestion. What we want is to have a simple / understandable approach that is the least magical as possible, but at the same time can get the job done. |
I totally agree. The last table example is the one that I had in mind when I started out. Perhaps the test example confuses things as because indexing the 'Name' column is only done in order to generate a two-row HTML header when |
I am having hard time reading the table with multi-indexed columns from census India using read_html. I just found the issue reported here and the discussion about how the original example reported was not multi-indexed, so thought it might be useful to mention this example. |
@AashitaK the table you linked is a little more complex than the ones I'm trying to handle here as it has multiple column headers but they span a variable number of columns. I am interested in tackling this problem next. |
@brianhuey Thanks. |
I believe the code achieves what @jorisvandenbossche suggests. It reads multiple header rows from HTML syntax, and parses them as a multiindex column, producing a dataframe with columns that match the structure of an HTML table. I'm open to any suggestions on where to improve how this is handled (i.e. the 'Unnamed: level' thing, specifying multiple headers). |
pandas/tests/io/test_html.py
Outdated
@@ -869,6 +881,18 @@ def test_computer_sales_page(self): | |||
data = os.path.join(DATA_PATH, 'computer_sales_page.html') | |||
self.read_html(data, header=[0, 1]) | |||
|
|||
def test_multiple_header_rows(self): | |||
# Issue #13434 | |||
expected_df = DataFrame(data=[("Hillary", 68, "D"), |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
is this test in here twice?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, it tests it twice: once using the BeautifulSoup parser and again with lxml.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
whoosh we should never actually duplicate the test code (not sure if this is existing)
rather use a base class (or parametization)
if this is the current style can u pls create an issue to fix this (if not pls fix this!)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I see, I misunderstood what was going on there. I'll fix.
8e7b03e
to
b54aa0c
Compare
thanks @brianhuey followups always welcome! |
I appreciate you working with me through this. First time contributing to an open source project! |
@brianhuey you did great! keep em coming! |
…13434 closes pandas-dev#13434 Author: Brian <[email protected]> Author: S. Brian Huey <[email protected]> Closes pandas-dev#15242 from brianhuey/thead-improvement and squashes the following commits: fc1c80e [S. Brian Huey] Merge branch 'master' into thead-improvement b54aa0c [Brian] removed duplicate test case 6ae2860 [Brian] updated docstring and io.rst 41fe8cd [Brian] review changes 873ea58 [Brian] switched from range to lrange cd70225 [Brian] ENH:read_html() handles tables with multiple header rows pandas-dev#13434
Uh oh!
There was an error while loading. Please reload this page.