Description
Code Sample, a copy-pastable example if possible
If we make a Timestamp in an ambiguous DST period while specifying via the offset (or by supplying Timestamp.value
directly) that the time is before DST switch, the representation then shows that this is after DST switch. This is backed up by calling Timestamp.tz.utcoffset(Timestamp)
.
IN:
t1 = pd.Timestamp(1382837400000000000, tz='dateutil/Europe/London')
t1
OUT:
Timestamp('2013-10-27 01:30:00+0100', tz='dateutil/GB-Eire')
IN:
t2 = pd.Timestamp(1382837400000000000, tz='Europe/London')
t2
OUT:
Timestamp('2013-10-27 01:30:00+0000', tz='Europe/London')
Problem description
The reason for this bug looks to be buried deep in the interaction of pandas
and dateutil
.
So this is what I've been able to dig up. When we try to determine whether we are in DST or not, we rely on timezone.utcoffset
of the underlying timezone package. What gets executed in dateutil
is this:
def utcoffset(self, dt):
...
return self._find_ttinfo(dt).delta
def _find_ttinfo(self, dt):
idx = self._resolve_ambiguous_time(dt)
...
def _resolve_ambiguous_time(self, dt):
idx = self._find_last_transition(dt)
# If we have no transitions, return the index
_fold = self._fold(dt)
if idx is None or idx == 0:
return idx
# If it's ambiguous and we're in a fold, shift to a different index.
idx_offset = int(not _fold and self.is_ambiguous(dt, idx))
return idx - idx_offset
dateutil
is expecting an ordinary datetime.timedelta
object here, so this is what it does:
- Use
_find_last_transition
to get the index of the last DST transition beforedt
. This is done by computingtimedelta.total_seconds
since epoch time. Ourpandas.Timedelta.total_seconds
is smart, and returns differenttotal_seconds
for before and afterDST
, since we basically returnTimedelta.value
which is the same asTimestamp.value
when counting since epoch time (because of how_Timestamp.__sub__
inc_timestamp.pyx
is implemented).
This is what we do (doesn't care about dt.replace(tzinfo=None)
):
def total_seconds(self):
"""
Total duration of timedelta in seconds (to microsecond precision).
"""
# GH 31043
# Microseconds precision to avoid confusing tzinfo.utcoffset
return (self.value - self.value % 1000) / 1e9
This is what datetime.timedelta
does (loses DST awareness after dt.replace(tzinfo=None)
):
def total_seconds(self):
"""Total seconds in the duration."""
return ((self.days * 86400 + self.seconds) * 10**6 +
self.microseconds) / 10**6
- The remainder of
_resolve_ambiguous_time
corrects for ambiguous times, sincedatetime.timedelta.total_seconds
afterdt.replace(tzinfo=None)
isn't DST-aware. It checks if we are in an ambiguous period and if this is the first time this time has occured: this is whatself._fold
is for. fold is 0 for the first time, and 1 for the second time. If it's the first time,dateutil
shifts the relevant transition index back by 1, since it thinks thattotal_seconds
always returns the number of seconds calculated using the second time.
I'd like to discuss how we are going to approach this. From what I see, there isn't much we can do on our end. Making Scratch that. The problem isn't so much the total_seconds
non-DST-aware by default is bad, because that would be making our implementation less precise unless the user passes a parameter.total_seconds
implementation as it is the Timestamp.__sub__
implementation which preserves value
when we subtract epoch time.
Another approach is to go to dateutil
with this and implement a check there to avoid running the correction if they are dealing with a pandas.Timedelta
. Might be tricky to do without introducing a dependency on pandas, though.
First came across this while solving #24329 in #30995
Expected Output
IN:
t1
OUT:
Timestamp('2013-10-27 01:30:00+0000', tz='dateutil/GB-Eire')
Output of pd.show_versions()
INSTALLED VERSIONS
commit : None
python : 3.7.6.final.0
python-bits : 64
OS : Windows
OS-release : 10
machine : AMD64
processor : Intel64 Family 6 Model 142 Stepping 10, GenuineIntel
byteorder : little
LC_ALL : None
LANG : ru_RU.UTF-8
LOCALE : None.None
pandas : 0.26.0.dev0+1947.gca3bfcc54.dirty
numpy : 1.17.5
pytz : 2019.3
dateutil : 2.8.1
pip : 19.3.1
setuptools : 44.0.0.post20200106
Cython : 0.29.14
pytest : 5.3.4
hypothesis : 5.2.0
sphinx : 2.3.1
blosc : None
feather : None
xlsxwriter : 1.2.7
lxml.etree : 4.4.2
html5lib : 1.0.1
pymysql : None
psycopg2 : None
jinja2 : 2.10.3
IPython : 7.11.1
pandas_datareader: None
bs4 : 4.8.2
bottleneck : 1.3.1
fastparquet : None
gcsfs : None
lxml.etree : 4.4.2
matplotlib : 3.1.2
numexpr : 2.7.1
odfpy : None
openpyxl : 3.0.1
pandas_gbq : None
pyarrow : None
pytables : None
pytest : 5.3.4
pyxlsb : None
s3fs : 0.4.0
scipy : 1.3.1
sqlalchemy : 1.3.12
tables : 3.6.1
tabulate : 0.8.6
xarray : None
xlrd : 1.2.0
xlwt : 1.3.0
xlsxwriter : 1.2.7
numba : 0.47.0