Description
Feature Type
-
Adding new functionality to pandas
-
Changing existing functionality in pandas
-
Removing existing functionality in pandas
Problem Description
strftime behaviour is currently not consistent for datetime arrays, for example with small dates (year<1000) :
> dta = pd.DatetimeIndex(np.array(['0020-01-01', '2020-01-02'], 'datetime64[s]'))
> dta.strftime(None) # Default (fast, not os-dependent)
Index(['20-01-01', '2020-01-02'], dtype='object') # no zero padding
> dta.strftime("%Y-%m-%d") # Falls back on default, so fast and not os-dependent
Index(['20-01-01', '2020-01-02'], dtype='object') # no zero padding
> dta.strftime("%Y_%m_%d") # Custom format, uses Timestamp.strftime
Index(['0020_01_01', '2020_01_02'], dtype='object') # Note the zero-padding on the year on windows, that does not happen on linux
Also, Timestamp.strftime
has the following behaviour:
> dta[0].strftime("%Y-%m-%d") # Relies on datetime.strftime > OS-dependent.
'0020-01-01' # Note the zero-padding on the year on windows, that does not happen on linux
Similar inconsistency can be observed on negative dates:
> dta = pd.DatetimeIndex(np.array(['-0020-01-01', '2020-01-02'], 'datetime64[s]'))
> dta.strftime(None) # Default (fast, not os-dependent)
Index(['-20-01-01', '2020-01-02'], dtype='object') # note: year is not zero-padded
> dta.strftime("%Y-%m-%d") # Falls back on default
Index(['-20-01-01', '2020-01-02'], dtype='object') # note: year is not zero-padded
> dta.strftime("%Y_%m_%d") # Custom so falls back on Timestamp.strftime
NotImplementedError: strftime not yet supported on Timestamps which are outside the range of Python's standard library. For now, please call the components you need (such as `.year` and `.month`) and construct your string from there.
> dta[0].strftime("%Y-%m-%d") # Relies on datetime.strftime > OS-dependent.
NotImplementedError: strftime not yet supported on Timestamps which are outside the range of Python's standard library. For now, please call the components you need (such as `.year` and `.month`) and construct your string from there.
Why is it so ?
strftime
was initially implemented through C strftime, which is platform-dependent. This is how cpython implements it too, in datetime.strftime
. (Note that the documentation makes some choices, for example shows the windows-based strftime for %Y
, because this is the format required for strptime
to parse it)
However strftime
is a routine that becomes slow when executed cell-by-cell on arrays, because the format string is parsed at every cell. This problem was detected in #44764 and fixed with appreciable performance gains #46116 (comment)
The fix bypasses the OS-dependent strftime and instead uses python string formatting, which is much faster. One drawback is the lack of transparency for users about which engine is used when.
Feature Description
I propose that we introduce an engine
parameter to the datetime/period arraylike strftime
operation, in the same spirit than the pyarrow/fastparquet engine
switcher in to_parquet
https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.to_parquet.html
This parameter could have three values:
'pystr'
: converts the strftime format to a python string template, and uses the python string formatting engine. Raises anUnsupportedStrFmtDirective
error when a strftime directive is not supported by the engine. This engine is significantly faster than'os'
on large arrays.'os'
: uses the platform (C lib)strftime
throughdatetime.strftime
. This engine is OS-dependent.'auto'
(default): if the strftime format can be successfully converted to a python string template, uses'pystr'
. Otherwise, silently falls back on'os'
.
In addition, I propose to add an equivalent parameter to the instance-level Timestamp.strftime
and Period.strftime
.
Alternative Solutions
Alternatively, if we think that the instance-level Timestamp.strftime
and Period.strftime
could potentially negatively impact performance for some usages, we could sacrifice consistency between the array strftime and the instance strftime, and keep a default value of 'os' for the instance one.
Additional Context
This was identified during discussions in #51298 (comment)