Description
Inspired by the timedelta plotting issue, I thought to look again at our timeseries plotting machinery. We know it is quite complex, and due to that several bugs, inconsistencies or unexpected behaviours exist (eg different results depending on order of plotting several serieses, wrong results when combining different types of time series, among others #9053, #6608, #14322, ..).
There has been some discussion related to this on the tsplot refactor PR of @sinhrks #7670 (not merged).
One of the reasons of the complexities is the distinction between 'irregular' and 'regular' time series (see eg #7670 (comment)):
- 'regular' time series plotting is based on Periods, and is used for timeseries with a freq or inferred_freq (and also for periods)
- 'irregular' time series plotting is based on the default matplotlib's handling of dates, i.e. converting to 'numerical values' (floats representing time in days since 0001-01-01, http://matplotlib.org/api/dates_api.html). You can always get this also for regular timeseries by passing
x_compat=True
.
So part of the problems and confusions comes from the differences between both (eg different label formatting) and from combining those two. Leading to the question:
Do we need both types of timeseries plotting?
The question is what the reason is that we convert DatetimeIndex to periods for plotting. The reasons I can think of:
- Performance. Currently, the regular plotting is faster (so for a regular series
ts.plot()
is faster asts.plot(x_compat=True)
). However, I think this could be solved as most of the time is spent in converting the datetimes to floats (which should be vectorizable). - Nicer tick label locations and formatting. This is a clear plus, our (convoluted) ticklocators and formatters give much nicer results as the default matplotlib (IMO)
Others reasons that I am missing?
But, there are also clear drawbacks. Apart from the things mentioned above, you sometimes get clearly wrong behaviour: see eg the plot in #7670 (comment). In this case, the dates somewhere within a month, are snapped to the month edges when first a regular series is plotted with monthy frequency.
Another example of 'wrong' plotting is a yearly series (bug with freq 'A-dec', so end of year) plotted in the beginning of a year. See http://nbviewer.jupyter.org/gist/jorisvandenbossche/c0c68dce2fa02f1dfc4a8c343ec88cb6. But of course, in many cases, this behaviour is can also be the desired behaviour.
But do we need both? Would we want, if possible, to unify into one approach?
Can we unify both approaches?
Can we just use the matplotlib floats for timeseries plotting? Or always use the period-based machinery?
- Using matplotlib's float-based plotting
- Do we want this? It will give slightly different behaviour for certain 'regular' cases.
- Assuming we can implement a similar tick locator/formatter comparable to period-based one. But, this may be impossible and the reason we have the current situation?
- But we could keep the PeriodConverter for purely plotting actual Periods
- Problem: float64 representing days can only give a precision of ~5µs, not up to 1ns (note: the period-based plotting can also not handle ns, but can handle 1µs precision).
- Using period-based plotting for all timeseries
- Do we want this? (deviates more from matplotlib -> larger difference in plotting dates with and without importing pandas)
- What prevents us from converting an irregular timeseries to Periods? I would think we can find some common freq in almost all cases? (just a high-precision freq if needed)
- Or create a new converter based on datetime64[ns] (so int64)?
- Instead of using matplotlibs floats, and instead of varying freq Periods (at least for DatetimeIndex)
- Again, assuming we can have nice tick label locator/formatting for this
cc @pandas-dev/pandas-core (especially @TomAugspurger and @sinhrks, I think you haven been most involved in plotting code recently, or @wesm for historical viewpoint)
I know it's a long issue, but if you could give it a read and give your thoughts on this, very welcome!