Description
As this comes up regularly:
- BUG: Unexpected pd.to_datetime result #12583
- pd.to_datetime/pd.DatetimeIndex unexpected behavior #12501
- Bug: inconsistent date parsing by to_datetime #7348
- Determine datetime_format from more than one element #15075
- pandas.io.tests.test_parsers.Test*.test_yy_format fails in 0.18.0 #12611 deal with
yearfirst/datefirst
inread_csv
- COMPAT: dateutil 2.5.2 broke sometests #12730, compat with dateutil 2.5.2
dayfirst=True, yearfirst=True
fixes - Greedy date inference in read_csv leads to inconsistent data #14301 raise if inconsistency in
read_csv
withdayfirst/yearfirst
- performance read_csv : using day first 23x to 35x slower than setting the format explicitly #25848
The problem
Depending on the string formats, you can get inconsistent parsing (meaning: different interpretation of parts of the strings as days vs months between different values in the passed list/array of strings).
For example:
In [22]: pd.to_datetime(["31/12/2014", "10/03/2011"])
Out[22]: DatetimeIndex(['2014-12-31', '2011-10-03'])
where the first is parsed as 'dayfirst' and the second not. Above particular case can be solved by passing dayfirst=True
, but in any case, the default behaviour of to_datetime
results in inconsistent parsing here.
Another typical case if where you do provide dayfirst=True
but have a malformed date, then it will fall back to 'not dayfirst' (#3341).
The reason
The reason is the flexibility of dateutil.parser.parse
. This is in the first place also a feature, as it lets you parse messy datetime colums. But when it comes to flexibility regarding dayfirst
/yearfirst
, this gives unexpected results.
So dateutil will 'try' to use dayfirst=True
, but when this does not give a valid date, it will ignore dayfirst=True
and parse the string anyway:
In [3]: dateutil.parser.parse('1/12/2012', dayfirst=True)
Out[3]: datetime.datetime(2012, 12, 1, 0, 0)
In [4]: dateutil.parser.parse('1/13/2012', dayfirst=True)
Out[4]: datetime.datetime(2012, 1, 13, 0, 0)
The same is true for dayfirst=False
(the default) as this will also try with 'day first' if the other way around failed.
The issue for the lack of strictness of dayfirst is actually #3341.
Possible solutions
- In some cases passing
dayfirst=True
is sufficient (if you don't have messy or malformed dates). - The typical answer is saying: "if you want to be sure to have strict and consistent parsing, provide a
format
argument.
We should probably stress this more in the docstring ofto_datetime
But, regardless of the two points above, fact is that the default behaviour of to_datetime
(without passing any arguments) can do inconsistent parsing and so give unexpected results.
Possible ways to have better default behaviour:
- Have the possibility to specify that
dateutil
should be strict aboutdayfirst
. I raised an issue for that here: Possibility to have a strict dayfirst dateutil/dateutil#214 - Changing the default of
infer_datetime_format
toTrue
would solve it in some cases (where the first datetime has the correct format you want), but certainly not all. - The idea of raising a warning when inconsistent parsing happened, is AFAIK not possible as pandas does not know how dateutil parsed the dates. And always warning when no format is passed or could be inferred (so in practice, when dateutil is used), is maybe to drastically, as the flexible parsing is also a feature.