Skip to content

read_html infers wrong datatype #7032

Closed
@ghost

Description

As can be seen in the below code, column 3, 8, 9, and 10 were misinterpreted as datetime objects. Columns 1, 6 and 7 should be integer. How do I force the columns to be interpreted as the proper type? Only 2, 4, 5 and 11 appear to have been read properly. I can pass 'infer_types=False' I suppose and do manual conversion afterwards, but since infer_types is going away, this won't work.

In [63]: import pandas as pd
In [64]: path = r"http://en.wikipedia.org/wiki/List_of_U.S._states_and_territories_by_population"
In [65]: tables = pd.read_html(path)
In [66]: df = tables[1]

In [67]: df.head()
Out[67]:
        1           2          3         4         5        6        7   8   \
1  !000001  California        NaT  37253956  33871648  !000053  !000055 NaT
2  !000002       Texas        NaT  25145561  20851820  !000036  !000038 NaT
3  !000003    New York 1965-11-27  19378102  18976457  !000027  !000029 NaT
4  !000004     Florida        NaT  18801310  15982378  !000027  !000029 NaT
5  !000005    Illinois        NaT  12830632  12419293  !000018  !000020 NaT

   9   10      11
1 NaT NaT  11.91%
2 NaT NaT   8.04%
3 NaT NaT   6.19%
4 NaT NaT   6.01%
5 NaT NaT   4.10%

[5 rows x 11 columns]

dtype: object

In [68]: df.dtypes
Out[68]:
1             object
2             object
3     datetime64[ns]
4             object
5             object
6             object
7             object
8     datetime64[ns]
9     datetime64[ns]
10    datetime64[ns]
11            object
dtype: object

Metadata

Metadata

Assignees

No one assigned

    Labels

    Dtype ConversionsUnexpected or buggy dtype conversionsIO HTMLread_html, to_html, Styler.apply, Styler.applymap

    Type

    No type

    Projects

    No projects

    Milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions