Skip to content

ENH: consistent strftime behaviour #58179

Open
@smarie

Description

@smarie

Feature Type

  • Adding new functionality to pandas

  • Changing existing functionality in pandas

  • Removing existing functionality in pandas

Problem Description

strftime behaviour is currently not consistent for datetime arrays, for example with small dates (year<1000) :

> dta = pd.DatetimeIndex(np.array(['0020-01-01', '2020-01-02'], 'datetime64[s]'))
> dta.strftime(None)  # Default (fast, not os-dependent)
Index(['20-01-01', '2020-01-02'], dtype='object')  # no zero padding

> dta.strftime("%Y-%m-%d")  # Falls back on default, so fast and not os-dependent
Index(['20-01-01', '2020-01-02'], dtype='object')  # no zero padding

> dta.strftime("%Y_%m_%d")  # Custom format, uses Timestamp.strftime
Index(['0020_01_01', '2020_01_02'], dtype='object')  # Note the zero-padding on the year on windows, that does not happen on linux

Also, Timestamp.strftime has the following behaviour:

> dta[0].strftime("%Y-%m-%d")  # Relies on datetime.strftime > OS-dependent.
'0020-01-01'  # Note the zero-padding on the year on windows, that does not happen on linux

Similar inconsistency can be observed on negative dates:

> dta = pd.DatetimeIndex(np.array(['-0020-01-01', '2020-01-02'], 'datetime64[s]'))
> dta.strftime(None)  # Default (fast, not os-dependent)
Index(['-20-01-01', '2020-01-02'], dtype='object')  # note: year is not zero-padded

> dta.strftime("%Y-%m-%d")  # Falls back on default
Index(['-20-01-01', '2020-01-02'], dtype='object')  # note: year is not zero-padded

> dta.strftime("%Y_%m_%d")  # Custom so falls back on Timestamp.strftime
NotImplementedError: strftime not yet supported on Timestamps which are outside the range of Python's standard library. For now, please call the components you need (such as `.year` and `.month`) and construct your string from there.

> dta[0].strftime("%Y-%m-%d")  # Relies on datetime.strftime > OS-dependent.
NotImplementedError: strftime not yet supported on Timestamps which are outside the range of Python's standard library. For now, please call the components you need (such as `.year` and `.month`) and construct your string from there.

Why is it so ?

strftime was initially implemented through C strftime, which is platform-dependent. This is how cpython implements it too, in datetime.strftime. (Note that the documentation makes some choices, for example shows the windows-based strftime for %Y, because this is the format required for strptime to parse it)

image

However strftime is a routine that becomes slow when executed cell-by-cell on arrays, because the format string is parsed at every cell. This problem was detected in #44764 and fixed with appreciable performance gains #46116 (comment)
The fix bypasses the OS-dependent strftime and instead uses python string formatting, which is much faster. One drawback is the lack of transparency for users about which engine is used when.

Feature Description

I propose that we introduce an engine parameter to the datetime/period arraylike strftime operation, in the same spirit than the pyarrow/fastparquet engine switcher in to_parquet https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.to_parquet.html

This parameter could have three values:

  • 'pystr': converts the strftime format to a python string template, and uses the python string formatting engine. Raises an UnsupportedStrFmtDirective error when a strftime directive is not supported by the engine. This engine is significantly faster than 'os' on large arrays.
  • 'os': uses the platform (C lib) strftime through datetime.strftime. This engine is OS-dependent.
  • 'auto' (default): if the strftime format can be successfully converted to a python string template, uses 'pystr'. Otherwise, silently falls back on 'os'.

In addition, I propose to add an equivalent parameter to the instance-level Timestamp.strftime and Period.strftime.

Alternative Solutions

Alternatively, if we think that the instance-level Timestamp.strftime and Period.strftime could potentially negatively impact performance for some usages, we could sacrifice consistency between the array strftime and the instance strftime, and keep a default value of 'os' for the instance one.

Additional Context

This was identified during discussions in #51298 (comment)

Metadata

Metadata

Assignees

No one assigned

    Labels

    EnhancementNeeds TriageIssue that has not been reviewed by a pandas team member

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions