Skip to content

BUG: Timestamp UTC offset incorrect for dateutil tz in ambiguous DST time #31338

Closed
@AlexKirko

Description

@AlexKirko

Code Sample, a copy-pastable example if possible

If we make a Timestamp in an ambiguous DST period while specifying via the offset (or by supplying Timestamp.value directly) that the time is before DST switch, the representation then shows that this is after DST switch. This is backed up by calling Timestamp.tz.utcoffset(Timestamp).

IN:
t1 = pd.Timestamp(1382837400000000000, tz='dateutil/Europe/London')
t1

OUT:
Timestamp('2013-10-27 01:30:00+0100', tz='dateutil/GB-Eire')

IN:
t2 = pd.Timestamp(1382837400000000000, tz='Europe/London')
t2

OUT:
Timestamp('2013-10-27 01:30:00+0000', tz='Europe/London')

Problem description

The reason for this bug looks to be buried deep in the interaction of pandas and dateutil.

So this is what I've been able to dig up. When we try to determine whether we are in DST or not, we rely on timezone.utcoffset of the underlying timezone package. What gets executed in dateutil is this:

def utcoffset(self, dt):
	...

	return self._find_ttinfo(dt).delta
	
def _find_ttinfo(self, dt):
	idx = self._resolve_ambiguous_time(dt)
	...

def _resolve_ambiguous_time(self, dt):
	idx = self._find_last_transition(dt)

	# If we have no transitions, return the index
	_fold = self._fold(dt)
	if idx is None or idx == 0:
		return idx

	# If it's ambiguous and we're in a fold, shift to a different index.
	idx_offset = int(not _fold and self.is_ambiguous(dt, idx))

	return idx - idx_offset

dateutil is expecting an ordinary datetime.timedelta object here, so this is what it does:

  1. Use _find_last_transition to get the index of the last DST transition before dt. This is done by computing timedelta.total_seconds since epoch time. Our pandas.Timedelta.total_seconds is smart, and returns different total_seconds for before and after DST, since we basically return Timedelta.value which is the same as Timestamp.value when counting since epoch time (because of how _Timestamp.__sub__ in c_timestamp.pyx is implemented).

This is what we do (doesn't care about dt.replace(tzinfo=None)):

def total_seconds(self):
	"""
	Total duration of timedelta in seconds (to microsecond precision).
	"""
	# GH 31043
	# Microseconds precision to avoid confusing tzinfo.utcoffset
	return (self.value - self.value % 1000) / 1e9

This is what datetime.timedelta does (loses DST awareness after dt.replace(tzinfo=None)):

def total_seconds(self):
	"""Total seconds in the duration."""
	return ((self.days * 86400 + self.seconds) * 10**6 +
			self.microseconds) / 10**6
  1. The remainder of _resolve_ambiguous_time corrects for ambiguous times, since datetime.timedelta.total_seconds after dt.replace(tzinfo=None) isn't DST-aware. It checks if we are in an ambiguous period and if this is the first time this time has occured: this is what self._fold is for. fold is 0 for the first time, and 1 for the second time. If it's the first time, dateutil shifts the relevant transition index back by 1, since it thinks that total_seconds always returns the number of seconds calculated using the second time.

I'd like to discuss how we are going to approach this. From what I see, there isn't much we can do on our end. Making total_seconds non-DST-aware by default is bad, because that would be making our implementation less precise unless the user passes a parameter. Scratch that. The problem isn't so much the total_seconds implementation as it is the Timestamp.__sub__ implementation which preserves value when we subtract epoch time.

Another approach is to go to dateutil with this and implement a check there to avoid running the correction if they are dealing with a pandas.Timedelta. Might be tricky to do without introducing a dependency on pandas, though.

First came across this while solving #24329 in #30995

Expected Output

IN:
t1
OUT:
Timestamp('2013-10-27 01:30:00+0000', tz='dateutil/GB-Eire')

Output of pd.show_versions()

INSTALLED VERSIONS

commit : None
python : 3.7.6.final.0
python-bits : 64
OS : Windows
OS-release : 10
machine : AMD64
processor : Intel64 Family 6 Model 142 Stepping 10, GenuineIntel
byteorder : little
LC_ALL : None
LANG : ru_RU.UTF-8
LOCALE : None.None

pandas : 0.26.0.dev0+1947.gca3bfcc54.dirty
numpy : 1.17.5
pytz : 2019.3
dateutil : 2.8.1
pip : 19.3.1
setuptools : 44.0.0.post20200106
Cython : 0.29.14
pytest : 5.3.4
hypothesis : 5.2.0
sphinx : 2.3.1
blosc : None
feather : None
xlsxwriter : 1.2.7
lxml.etree : 4.4.2
html5lib : 1.0.1
pymysql : None
psycopg2 : None
jinja2 : 2.10.3
IPython : 7.11.1
pandas_datareader: None
bs4 : 4.8.2
bottleneck : 1.3.1
fastparquet : None
gcsfs : None
lxml.etree : 4.4.2
matplotlib : 3.1.2
numexpr : 2.7.1
odfpy : None
openpyxl : 3.0.1
pandas_gbq : None
pyarrow : None
pytables : None
pytest : 5.3.4
pyxlsb : None
s3fs : 0.4.0
scipy : 1.3.1
sqlalchemy : 1.3.12
tables : 3.6.1
tabulate : 0.8.6
xarray : None
xlrd : 1.2.0
xlwt : 1.3.0
xlsxwriter : 1.2.7
numba : 0.47.0

Metadata

Metadata

Assignees

Labels

Type

No type

Projects

No projects

Milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions