Skip to content

EWMA weighted by time with adjust=True is flawed, and adjust=False is not supported #54328

Open
@hleumas

Description

@hleumas

Pandas version checks

  • I have checked that this issue has not already been reported.

  • I have confirmed this bug exists on the latest version of pandas.

  • I have confirmed this bug exists on the main branch of pandas.

Reproducible Example

import math
import pandas as pd

# Current EWM implementation in pandas
def ema(df, e_time):
    return df.ewm(
        halflife=pd.to_timedelta(e_time),
        times=df.index,
    ).mean()

# Correct EWM calculation for reference
def ema_manual(df, e_time):
    it = df.items()
    _list = []

    lt, s = next(it)
    _list.append(s)

    for t, v in it:
        q = math.exp2((lt - t) / e_time)
        s = s * q + (1 - q) * v
        lt = t
        _list.append(s)

    return pd.DataFrame(_list, index=df.index)

zeroes = 1_000_000
ms = 1_000_000 # 1ms

# Let's create a list of 500,000 0s spaced a 1ms apart with a single 1 exactly
# 1 second after the last 0.
val = zeroes * [0] + [1]
id = list(range(0, zeroes * ms, ms))
id.append(id[-1] + 1000 * ms)
id = pd.to_datetime(id)
df = pd.DataFrame(val, columns=['val'], index=id)


# We would expect the EWM of these series with halflife=1s to be exactly 0.5,
# which is confirmed by the manual calculation
print(ema_manual(df['val'], '1s'))

# Whereas pandas EWM returns number close to 0
print(ema(df['val'], '1s'))

Issue Description

Pandas seems to have an issue with timeseries with uneven intervals. Assume following example:

1 million of 0s spaced 1 millisecond apart followed by a single 1 after a 1 second gap:

1970-01-01 00:00:00.000    0
1970-01-01 00:00:00.001    0
1970-01-01 00:00:00.002    0
1970-01-01 00:00:00.003    0
1970-01-01 00:00:00.004    0
...  999,991 more items  ...
1970-01-01 00:16:39.996    0
1970-01-01 00:16:39.997    0
1970-01-01 00:16:39.998    0
1970-01-01 00:16:39.999    0
-----   1 second gap   -----
1970-01-01 00:16:40.999    1

[1000001 rows x 1 columns]

One would naively expect the exponential moving average with a halflife of 1 second to equal to exactly 0.5, assuming equation:

w    = 0.5 ** (dt/halflife) = 0.5 **(1s/1s)
y(t) = y(t - 1) * w + x * (1 - w)

However, due to the way adjustment factor is calculated, this is not true. Unfortunately adjustment=True works correctly only for evenly spaced time series and in this situation leads to extremely small result of 0.001384.

Also, unfortunately, adjustment=False is disabled for calculations where times argument is set.

Expected Behavior

Result that is independent of the sampling frequency irregularities. Thus, increasing sampling frequency in the early times (where 0s are measured) shouldn't lead to increasing their weight in the result.

Result of 0.5 instead of 0.001384.

Installed Versions

INSTALLED VERSIONS

commit : 0f43794
python : 3.11.4.final.0
python-bits : 64
OS : Darwin
OS-release : 22.5.0
Version : Darwin Kernel Version 22.5.0: Thu Jun 8 22:22:20 PDT 2023; root:xnu-8796.121.3~7/RELEASE_ARM64_T6000
machine : arm64
processor : arm
byteorder : little
LC_ALL : None
LANG : None
LOCALE : None.UTF-8

pandas : 2.0.3
numpy : 1.25.1
pytz : 2023.3
dateutil : 2.8.2
setuptools : 67.8.0
pip : 23.1.2
Cython : None
pytest : None
hypothesis : None
sphinx : None
blosc : None
feather : None
xlsxwriter : None
lxml.etree : None
html5lib : None
pymysql : None
psycopg2 : None
jinja2 : 3.1.2
IPython : 8.14.0
pandas_datareader: None
bs4 : None
bottleneck : None
brotli : None
fastparquet : None
fsspec : None
gcsfs : None
matplotlib : 3.7.2
numba : None
numexpr : None
odfpy : None
openpyxl : None
pandas_gbq : None
pyarrow : 12.0.1
pyreadstat : None
pyxlsb : None
s3fs : None
scipy : None
snappy : None
sqlalchemy : None
tables : None
tabulate : None
xarray : None
xlrd : None
zstandard : None
tzdata : 2023.3
qtpy : None
pyqt5 : None

Metadata

Metadata

Assignees

No one assigned

    Labels

    BugWindowrolling, ewma, expanding

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions