Strong coupling on read_html and automatic conversion results in lost data for incorrectly inferred types

I am using read_html to read an html table which contains timedelta information in various formats, e.g. "20s" or "20 seconds" for 20 seconds, "5 minutes", "12 hours", "4 days". The default parser incorrectly parses these in 1) converting the values to datetimes, with the time added as a timedelta to the start of the current day and  2) setting the days values to NaT

Previously, I could set the infer_types to None and wonderfully, no types would be inferred. Now, the data is simply lost and I have no option using read_html to preserve it.

The following approaches would solve the issue for me:

1) re-enable infer_types=None. I fail to see the reason why this was ever disabled. Coupling of behaviors is rarely desirable in libraries. The only behavior I actually want is to read an html table into a dataframe of strings. From there, I can convert data as desired or not.

2) implement date_parser as in read_csv. While this would solve the issue for me, this would still result in unavoidably undesirable behaviors for the other data types, e.g. the parser auto-converting money amount to float when decimal is required.

I'd love for pandas to be my go-to library for scraping web page tables, but with current the behavior, it's pretty much useless to me.

see also:

https://github.com/pydata/pandas/issues/7037

https://github.com/pydata/pandas/pull/4770

Code example:
# pandas 0.16.2
# python 3.4.3

```
import pandas as pd
import requests 
import re

url = 'http://clashofclans.wikia.com/wiki/Barbarian'

overview_regex = re.compile('Preferred|Radius')
overview_table = pd.read_html(url, match=overview_regex, header=0 )
print overview_table #expect attack speed = 20s or 20 seconds, actual = 2015-07-27 00:00:01 

print '----'
levels_regex = re.compile('Hitpoints|Total')
levels_table = pd.read_html(url, match=levels_regex, header=0 )
print levels_table #expect research time in hours or days, actual = NaT or 2015-07-27 06:00:00
```


Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Strong coupling on read_html and automatic conversion results in lost data for incorrectly inferred types #10684

pandas 0.16.2

python 3.4.3

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Strong coupling on read_html and automatic conversion results in lost data for incorrectly inferred types #10684

Description

pandas 0.16.2

python 3.4.3

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions