Description
Pandas version checks
-
I have checked that this issue has not already been reported.
-
I have confirmed this bug exists on the latest version of pandas.
-
I have confirmed this bug exists on the main branch of pandas.
Reproducible Example
import pandas as pd
from pandas._libs.tslibs import to_offset
# 1
# Freq argument is ignored when using different multiple
hourly = to_offset("H")
p = pd.Period("2020-01-01", freq="24H")
assert pd.Period(p, hourly).freq == to_offset("24H")
# 2
# asfreq shifts value, even when using same frequency
p = pd.Period("2020-01-01", freq="24H")
assert p != p.asfreq(p.freq)
# also, consider this example
dr = pd.date_range("2020", freq="2d", periods=3)
s1 = dr.to_series().asfreq(dr.freq).to_period()
s2 = dr.to_series().to_period().asfreq(dr.freq)
# one would expect s1 and s2 to be the same, but of course not!
>>> s1
2020-01-01 2020-01-01
2020-01-03 2020-01-03
2020-01-05 2020-01-05
Freq: 2D, dtype: datetime64[ns]
>>> s2
2020-01-02 2020-01-01
2020-01-04 2020-01-03
2020-01-06 2020-01-05
Freq: 2D, dtype: datetime64[ns]
# 3
# When providing two periods in period_range, only start of end is taken into consideration
pr = pd.period_range(pd.Period("2020-01-01 00:00", "6H"), pd.Period("2020-01-01 18:00", "6H"), freq="H")
pr[0] == "2020-01-01 0:00"
pr[-1] == "2020-01-01 18:00" # why not 23:00?
len(pr) == 19
# which of course is inconsistent with
pr = pd.period_range(pd.Period("2020Q1", "Q"), pd.Period("2020Q2", "Q"), freq="M")
pr[0] == "2020-03" # why not 2020-01?
pr[-1] == "2020-06"
# which then again behaves differently from
dr = pd.date_range(pd.Timestamp("2020Q1", "Q"), pd.Timestamp("2020Q2", "Q"), freq="M")
dr[0] == "2020-01-31"
dr[-1] == "'2020-03-31'
Issue Description
The behaviour of Period
and period_range
is just very surprising and inconsistent.
Is is inconsistent in itself but also when comparing period_range
with date_range
.
See also: #47465
Expected Behavior
I naively would expect that a Period
represents a time-range. There is a start where the period begins and an end where it ends:
p = pd.Period("2020-01-01", "2d")
Here p
represents everything on the first two days of 2020.
If I use period_range
, I would expect it to take the entire range of start and end into account:
start = p
end = p + 1
pr = pd.period_range(start, end, freq="2D")
assert pr[-1].end_time == end.end_time
So far so good. Let's try a different frequency:
pr2 = pd.period_range(start, end, freq="D")
assert pr2[-1].end_time == end.end_time # Fails
How naive of me! Of course the second argument is neither inclusive nor exclusive when generating the range, but a happy mix of both:
pr2[-1] == end.asfreq("D", "S") # note how neither using Period nor .asfreq("D") would work
The new range includes everything from start.start_time
until pd.Period(end.start_time, "D").end_time
just one would expect.
The rules are clear now.
So let's just try a different example.
start = pd.Period("2020Q1", "Q")
end = pd.Period("2020Q2", "Q")
pr = pd.period_range(start, end, freq="M")
We know: The start of pr
should be start.start_time
and end should be pd.Period(end.start_time, "D").end_time
:
pr3[0] == "2020-03"
pr3[-1] == "2020-06"
🤯