BUG: read_html does not parse correctly the header of non-string columns

I presume that the problem is that the data is first parsed and then the header is selected out. But when the dtype of the column is a number type the item that should become the column name, since it's not a valid number, becomes `NaN`.

Sample data:

``` python
data1 = io.StringIO(u'''<table>
    <thead>
        <tr>
            <th>Country</th>
            <th>Municipality</th>
            <th>Year</th>
        </tr>
    </thead>
    <tbody>
        <tr>
            <td>Ukraine</td>
            <th>Odessa</th>
            <td>1944</td>
        </tr>
    </tbody>
</table>''')
data2 = io.StringIO(u'''
<table>
    <tbody>
        <tr>
            <th>Country</th>
            <th>Municipality</th>
            <th>Year</th>
        </tr>
        <tr>
            <td>Ukraine</td>
            <th>Odessa</th>
            <td>1944</td>
        </tr>
    </tbody>
</table>''')
```

Output:

``` python
>>> pd.read_html(data1)[0]
   Country Municipality  Year
0  Ukraine       Odessa  1944
>>> pd.read_html(data2, header=0)[0]
0  Country Municipality   NaN
1  Ukraine       Odessa  1944
```


Provide feedback

Saved searches

Use saved searches to filter your results more quickly

BUG: read_html does not parse correctly the header of non-string columns #5048

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

BUG: read_html does not parse correctly the header of non-string columns #5048

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions