Description
Pandas version checks
-
I have checked that this issue has not already been reported.
-
I have confirmed this bug exists on the latest version of pandas.
-
I have confirmed this bug exists on the main branch of pandas.
Reproducible Example
import math
import pandas as pd
# Current EWM implementation in pandas
def ema(df, e_time):
return df.ewm(
halflife=pd.to_timedelta(e_time),
times=df.index,
).mean()
# Correct EWM calculation for reference
def ema_manual(df, e_time):
it = df.items()
_list = []
lt, s = next(it)
_list.append(s)
for t, v in it:
q = math.exp2((lt - t) / e_time)
s = s * q + (1 - q) * v
lt = t
_list.append(s)
return pd.DataFrame(_list, index=df.index)
zeroes = 1_000_000
ms = 1_000_000 # 1ms
# Let's create a list of 500,000 0s spaced a 1ms apart with a single 1 exactly
# 1 second after the last 0.
val = zeroes * [0] + [1]
id = list(range(0, zeroes * ms, ms))
id.append(id[-1] + 1000 * ms)
id = pd.to_datetime(id)
df = pd.DataFrame(val, columns=['val'], index=id)
# We would expect the EWM of these series with halflife=1s to be exactly 0.5,
# which is confirmed by the manual calculation
print(ema_manual(df['val'], '1s'))
# Whereas pandas EWM returns number close to 0
print(ema(df['val'], '1s'))
Issue Description
Pandas seems to have an issue with timeseries with uneven intervals. Assume following example:
1 million of 0
s spaced 1 millisecond apart followed by a single 1
after a 1 second gap:
1970-01-01 00:00:00.000 0
1970-01-01 00:00:00.001 0
1970-01-01 00:00:00.002 0
1970-01-01 00:00:00.003 0
1970-01-01 00:00:00.004 0
... 999,991 more items ...
1970-01-01 00:16:39.996 0
1970-01-01 00:16:39.997 0
1970-01-01 00:16:39.998 0
1970-01-01 00:16:39.999 0
----- 1 second gap -----
1970-01-01 00:16:40.999 1
[1000001 rows x 1 columns]
One would naively expect the exponential moving average with a halflife of 1 second to equal to exactly 0.5, assuming equation:
w = 0.5 ** (dt/halflife) = 0.5 **(1s/1s)
y(t) = y(t - 1) * w + x * (1 - w)
However, due to the way adjustment factor is calculated, this is not true. Unfortunately adjustment=True
works correctly only for evenly spaced time series and in this situation leads to extremely small result of 0.001384
.
Also, unfortunately, adjustment=False
is disabled for calculations where times
argument is set.
Expected Behavior
Result that is independent of the sampling frequency irregularities. Thus, increasing sampling frequency in the early times (where 0s are measured) shouldn't lead to increasing their weight in the result.
Result of 0.5
instead of 0.001384
.
Installed Versions
INSTALLED VERSIONS
commit : 0f43794
python : 3.11.4.final.0
python-bits : 64
OS : Darwin
OS-release : 22.5.0
Version : Darwin Kernel Version 22.5.0: Thu Jun 8 22:22:20 PDT 2023; root:xnu-8796.121.3~7/RELEASE_ARM64_T6000
machine : arm64
processor : arm
byteorder : little
LC_ALL : None
LANG : None
LOCALE : None.UTF-8
pandas : 2.0.3
numpy : 1.25.1
pytz : 2023.3
dateutil : 2.8.2
setuptools : 67.8.0
pip : 23.1.2
Cython : None
pytest : None
hypothesis : None
sphinx : None
blosc : None
feather : None
xlsxwriter : None
lxml.etree : None
html5lib : None
pymysql : None
psycopg2 : None
jinja2 : 3.1.2
IPython : 8.14.0
pandas_datareader: None
bs4 : None
bottleneck : None
brotli : None
fastparquet : None
fsspec : None
gcsfs : None
matplotlib : 3.7.2
numba : None
numexpr : None
odfpy : None
openpyxl : None
pandas_gbq : None
pyarrow : 12.0.1
pyreadstat : None
pyxlsb : None
s3fs : None
scipy : None
snappy : None
sqlalchemy : None
tables : None
tabulate : None
xarray : None
xlrd : None
zstandard : None
tzdata : 2023.3
qtpy : None
pyqt5 : None