Description
to_datetime has an argument infer_datetime_format
which, if set to True
, will guess the format from the first non-NaN row.
People (users, but also core devs, e.g. here and here), expect that the format inferred from the first row will be applied to the rest of the series. i.e. that the following two should behave the same:
pd.to_datetime(['01-31-2000', '20-01-2000'], infer_datetime_format=True)
pd.to_datetime(['01-31-2000', '20-01-2000'], format='%m-%d-%Y')
However, they don't: the latter raises, whilst the first one swaps format midway.
Although this is documented in the user guide, it's not what people expect.
Making this argument strict would align more to people's expectations, but also simplify the codebase, as it would get rid of special-casing such as
pandas/pandas/core/tools/datetimes.py
Lines 488 to 499 in ac648ee
TL;RD I'm suggesting that when using infer_datetime_format=True
, the format detected from the first non-NaN value should be used to parse the rest of the Series, exactly as if the user had passed it to format=
This would be one step towards addressing #12585
@pandas-dev/pandas-core any thoughts here?
EDIT: I'm hoping that #48621 can supersede this