Skip to content

BUG: read_html does not parse correctly the header of non-string columns #5048

Closed
@alefnula

Description

@alefnula

I presume that the problem is that the data is first parsed and then the header is selected out. But when the dtype of the column is a number type the item that should become the column name, since it's not a valid number, becomes NaN.

Sample data:

data1 = io.StringIO(u'''<table>
    <thead>
        <tr>
            <th>Country</th>
            <th>Municipality</th>
            <th>Year</th>
        </tr>
    </thead>
    <tbody>
        <tr>
            <td>Ukraine</td>
            <th>Odessa</th>
            <td>1944</td>
        </tr>
    </tbody>
</table>''')
data2 = io.StringIO(u'''
<table>
    <tbody>
        <tr>
            <th>Country</th>
            <th>Municipality</th>
            <th>Year</th>
        </tr>
        <tr>
            <td>Ukraine</td>
            <th>Odessa</th>
            <td>1944</td>
        </tr>
    </tbody>
</table>''')

Output:

>>> pd.read_html(data1)[0]
   Country Municipality  Year
0  Ukraine       Odessa  1944
>>> pd.read_html(data2, header=0)[0]
0  Country Municipality   NaN
1  Ukraine       Odessa  1944

Metadata

Metadata

Assignees

No one assigned

    Labels

    BugIO HTMLread_html, to_html, Styler.apply, Styler.applymap

    Type

    No type

    Projects

    No projects

    Milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions