|
| 1 | +======== |
| 2 | + Design |
| 3 | +======== |
| 4 | + |
| 5 | +When a Matplotlib :obj:`~matplotlib.artist.Artist` object in rendered via the `~matplotlib.artist.Artist.draw` method the following |
| 6 | +steps happen (in spirit but maybe not exactly in code): |
| 7 | + |
| 8 | +1. get the data |
| 9 | +2. convert from unit-full to unit-less data |
| 10 | +3. convert the unit-less data from user-space to rendering-space |
| 11 | +4. call the backend rendering functions |
| 12 | + |
| 13 | +.. |
| 14 | + If we were to call these steps :math:`f_1` through :math:`f_4` this can be expressed as (taking |
| 15 | + great liberties with the mathematical notation): |
| 16 | +
|
| 17 | + .. math:: |
| 18 | +
|
| 19 | + R = f_4(f_3(f_2(f_1()))) |
| 20 | +
|
| 21 | + or if you prefer |
| 22 | + |
| 23 | + .. math:: |
| 24 | +
|
| 25 | + R = (f_4 \circ f_3 \circ f_2 \circ f_1)() |
| 26 | +
|
| 27 | + It is reasonable that if we can do this for one ``Artist``, we can build up |
| 28 | + more complex visualizations by rendering multiple ``Artist`` to the same |
| 29 | + target. |
| 30 | + |
| 31 | +However, this clear structure is frequently elided and obscured in the |
| 32 | +Matplotlib code base: Step 3 is only present for *x* and *y* like data (encoded |
| 33 | +in the `~matplotlib.transforms.TransformNode` objects) and color mapped data |
| 34 | +(implemented in the `.matplotlib.colors.ScalarMappable` family of classes); the |
| 35 | +application of Step 2 is inconsistent (both in actual application and when it |
| 36 | +is applied) between artists; each ``Artist`` stores it's data in its own way |
| 37 | +(typically as numpy arrays). |
| 38 | + |
| 39 | +With this view, we can understand the `~matplotlib.artist.Artist.draw` methods |
| 40 | +to be very extensively `curried |
| 41 | +<https://en.wikipedia.org/wiki/Curry_(programming_language)>`__ version of |
| 42 | +these function chains where the objects allow us to modify the arguments to the |
| 43 | +functions. |
| 44 | + |
| 45 | +The goal of this work is to bring this structure more the foreground in the internal of |
| 46 | +Matplotlib to make it easier to reason about, easier to extend, and easier to inject |
| 47 | +custom logic at each of the steps |
| 48 | + |
| 49 | +A paper with the formal mathematical description of these ideas is in |
| 50 | +preparation. |
| 51 | + |
| 52 | +Data pipeline |
| 53 | +============= |
| 54 | + |
| 55 | +Get the data (Step 1) |
| 56 | +--------------------- |
| 57 | + |
| 58 | +Currently, almost all ``Artist`` class store the data associated with them as |
| 59 | +attributes on the instances as `numpy.array` objectss. On one hand, this can |
| 60 | +be very useful as historically data was frequently already in `numpy.array` |
| 61 | +objects and, if you know the right methods for *this* ``Artist`` you can access |
| 62 | +that state to update or query it. From a certain point of view, this is |
| 63 | +consistent with the scheme laid out above as ``self.x[:]`` is really |
| 64 | +``self.x.__getitem__(slice())`` which is (technically) a function call. |
| 65 | + |
| 66 | +However, this has several drawbacks. In most cases the data attributes on an |
| 67 | +``Artist`` are closely linked -- the *x* and *y* on a |
| 68 | +`~matplotlib.lines.Line2D` must be the same length -- and by storing them |
| 69 | +separately it is possible that they will get out of sync in problematic ways. |
| 70 | +Further, because the data is stored as materialized ``numpy`` arrays, there we |
| 71 | +must decide before draw time what the correct sampling of the data is. While |
| 72 | +there are some projects like `grave <https://networkx.org/grave/>`__ that wrap |
| 73 | +richer objects or `mpl-modest-image |
| 74 | +<https://github.com/ChrisBeaumont/mpl-modest-image>`__, `datashader |
| 75 | +<https://datashader.org/getting_started/Interactivity.html#native-support-for-matplotlib>`__, |
| 76 | +and `mpl-scatter-density <https://github.com/astrofrog/mpl-scatter-density>`__ |
| 77 | +that dynamically re-sample the data these are niche libraries. |
| 78 | + |
| 79 | +The first goal of this project is to bring support for draw-time resampleing to |
| 80 | +every Matplotlib ``Artist`` out of the box. The current approach is to move |
| 81 | +all of the data storage off of the ``Artist`` directly and into a (so-called) |
| 82 | +`~data_prototype.containers.DataContainer` instance. The primary method on these objects |
| 83 | +is the `~data_prototype.containers.DataContainer.query` method which has the signature :: |
| 84 | + |
| 85 | + def query( |
| 86 | + self, |
| 87 | + transform: _Transform, |
| 88 | + size: Tuple[int, int], |
| 89 | + ) -> Tuple[Dict[str, Any], Union[str, int]]: |
| 90 | + |
| 91 | +The query is passed in: |
| 92 | + |
| 93 | +- A transform from "Axes" to "data" (using Matplotlib's names for the `various |
| 94 | + coordinate systems |
| 95 | + <https://matplotlib.org/stable/tutorials/advanced/transforms_tutorial.html>`__ |
| 96 | +- A notion of how big the axes is in "pixels" to provide guidance on what the correct number |
| 97 | + of samples to return is. |
| 98 | + |
| 99 | +It will return: |
| 100 | + |
| 101 | +- A mapping of strings to things that is coercible (with the help of the |
| 102 | + functions is steps 2 and 3) to a numpy array or types understandable by the |
| 103 | + backends. |
| 104 | +- A key that can be used for caching |
| 105 | + |
| 106 | +This function will be called at draw time by the ``Aritist`` to get the data to |
| 107 | +be drawn. In the simplest cases |
| 108 | +(e.g. `~data_prototype.containers.ArrayContainer` and |
| 109 | +`~data_prototype.containers.DataFrameContainer`) the ``query`` method ignores |
| 110 | +the input and returns the data as-is. However, based on these inputs it is |
| 111 | +possible for the ``query`` method to get the data limits, even sampling in |
| 112 | +screen space, and an approximate estimate of the resolution of the |
| 113 | +visualization. This also opens up several interesting possibilities: |
| 114 | + |
| 115 | +1. "Pure function" containers (such as |
| 116 | + `~data_prototype.containers.FuncContainer`) which will dynamically sample a |
| 117 | + function at "a good resolution" for the current data limits and screen size. |
| 118 | +2. A "resampling" container that either down-samples or slices the data it holds based on |
| 119 | + the current view limits. |
| 120 | +3. A container that makes a network or database call and automatically refreshes the data |
| 121 | + as a function of time. |
| 122 | +4. Containers that do binning or aggregation of the user data (such as |
| 123 | + `~data_prototype.containers.HistContainer`). |
| 124 | + |
| 125 | +By accessing all of the data that is needed in draw in a single function call |
| 126 | +the ``DataContainer`` instances can ensure that the data is coherent and |
| 127 | +consistent. This is important for applications like steaming where different |
| 128 | +parts of the data may be arriving at different rates and it would thus be the |
| 129 | +``DataContainer``'s responsibility to settle any race conditions and always |
| 130 | +return aligned data to the ``Artist``. |
| 131 | + |
| 132 | + |
| 133 | +There is still some ambiguity as to what should be put in the data. For |
| 134 | +example with `~matplotlib.lines.Line2D` it is clear that the *x* and *y* data |
| 135 | +should be pulled from the ``DataConatiner``, but things like *color* and |
| 136 | +*linewidth* are ambiguous. A later section will make the case that it should be |
| 137 | +possible, but maybe not required, that these values be accessible in the data |
| 138 | +context. |
| 139 | + |
| 140 | +An additional task that the ``DataContainer`` can do is to describe the type, |
| 141 | +shape, fields, and topology of the data it contains. For example a |
| 142 | +`~matplotlib.lines.Line2D` needs an *x* and *y* that are the same length, but |
| 143 | +`~matplotlib.patches.StepPatch` (which is also a 2D line) needs a *x* that is |
| 144 | +one longer than the *y*. The difference is that a ``Line2D`` in points with |
| 145 | +values which can be continuously interpolated between and ``StepPatch`` is bin |
| 146 | +edges with a constant value between the edges. This design lets us make |
| 147 | +explicit the implicit encoding of this sort of distinction in Matplotlib and be |
| 148 | +able to programatically operate on it. The details of exactly how to encode |
| 149 | +all of this still needs to be developed. There is a |
| 150 | +`~data_prototype.containers.DataContainer.describe` method, however it is the |
| 151 | +most provisional part of the current design. |
| 152 | + |
| 153 | + |
| 154 | +Unit conversion (Step 2) |
| 155 | +------------------------ |
| 156 | + |
| 157 | +Real data almost always has some units attached to it. Historically, this |
| 158 | +information can be carried "out of band" in the structure of the code or in |
| 159 | +custom containers or data types that are unit-aware. The recent work on ``numpy`` to |
| 160 | +make ``np.dtype`` more easily extendable is likely to make unit-full data much more |
| 161 | +common and easier to work with in the future. |
| 162 | + |
| 163 | +In principle the user should be able to plot sets of data, one of them in *ft* |
| 164 | +the other in *m* and then show the ticks in *in* and then switch to *cm* and |
| 165 | +have everything "just work" for all plot types. Currently we are very far from |
| 166 | +this due to some parts of the code eagerly converting to the unit-less |
| 167 | +representation and not keeping the original, some parts of the code failing to |
| 168 | +do the conversion at all, some parts doing the conversion after coercing to |
| 169 | +``numpy`` and losing the unit information, etc. Further, because the data |
| 170 | +access and processing pipeline is done differently in every ``Artist`` it is a |
| 171 | +constant game of whack-a-bug to keep this working. If we adopt the consistent |
| 172 | +``DataContainer`` model for accessing the data and call |
| 173 | +`~data_prototype.containers.DataContainer.query` at draw time we will have a |
| 174 | +consistent place to also do the unit conversion. |
| 175 | + |
| 176 | +The ``DataContainer`` can also carry inspectable information about what the |
| 177 | +units of its data are in which would make it possible to do ahead-of-time |
| 178 | +verification that the data of all of the ``Artists`` in an ``Axes`` are |
| 179 | +consistent with unit converters on the ``Axis``. |
| 180 | + |
| 181 | + |
| 182 | +Convert for rendering (Step 3) |
| 183 | +------------------------------ |
| 184 | + |
| 185 | +The next step is to get the data from unit-less "user data" into something that |
| 186 | +the backend renderer understand. This can range from coordinate |
| 187 | +transformations (as with the ``Transfrom`` stack operations on *x* and *y* like |
| 188 | +values), representation conversions (like named colors to RGB values), mapping |
| 189 | +stings to a set of objects (like named markershape), to paraaterized type |
| 190 | +conversion (like colormapping). Although Matplotlib is currently doing all of |
| 191 | +these conversions, the user really only has control of the position and |
| 192 | +colormapping (on `~matplotlib.colors.ScalarMappable` sub-classes). The next |
| 193 | +thing that this design allows is for user defined functions to be passed for |
| 194 | +any of the relevant data fields. |
| 195 | + |
| 196 | +This will open up paths to do a number of nice things such as multi-variate |
| 197 | +color maps, lines who's width and color vary along their length, constant but |
| 198 | +parameterized colors and linestyles, and a version of ``scatter`` where the |
| 199 | +marker shape depends on the data. All of these things are currently possible |
| 200 | +in Matplotlib, but require significant work before calling Matplotlib and can |
| 201 | +be very difficult to update after the fact. |
| 202 | + |
| 203 | +Pass to backend (Step 4) |
| 204 | +------------------------ |
| 205 | + |
| 206 | +This part of the process is proposed to remain unchanged from current |
| 207 | +Matplotlib. The calls to the underlying ``Renderer`` objects in ``draw`` |
| 208 | +methods have stood the test of time and changing them is out of scope for the |
| 209 | +current work. In the future we may want to consider eliding Steps 3 and 4 in |
| 210 | +some cases for performance reasons to be able push the computation down to a |
| 211 | +GPU. |
| 212 | + |
| 213 | + |
| 214 | +Caching |
| 215 | +======= |
| 216 | + |
| 217 | +A key to keeping this implementation efficient is to be able to cache when we |
| 218 | +have to re-compute values. Internally current Matplotlib has a number of |
| 219 | +ad-hoc caches, such as in ``ScalarMappable`` and ``Line2D``. Going down the |
| 220 | +route of hashing all of the data is not a sustainable path (in the case even |
| 221 | +modestly sized data the time to hash the data will quickly out-strip any |
| 222 | +possible time savings doing the cache lookup!). The proposed ``query`` method |
| 223 | +returns a cache key that it generates to the caller. The exact details of how |
| 224 | +to generate that key are left to the ``DataContainer`` implementation, but if |
| 225 | +the returned data changed, then the cache key must change. The cache key |
| 226 | +should be computed from a combination of the ``DataContainers`` internal state, |
| 227 | +the transform and size passed in. |
| 228 | + |
| 229 | +The choice to return the data and cache key in one step, rather than be a two |
| 230 | +step process is drive by simplicity and because the cache key is computed |
| 231 | +inside of the ``query`` call. If computing the cache key is fast and the data |
| 232 | +to be returned in "reasonable" for the machine Matplotlib is running on (it |
| 233 | +needs to be or we won't render!), then if it makes sense to cache the results |
| 234 | +it can be done by the ``DataContainer`` and returned straight away along with |
| 235 | +the computed key. |
| 236 | + |
| 237 | +There will need to be some thought put into cache invalidation and size |
| 238 | +management at the ``Artist`` layer. We also need to determine how many cache |
| 239 | +layers to keep. Currently only the results of Step 3 are cached, but we may |
| 240 | +want to additionally cache intermediate results after Step 2. The caching from |
| 241 | +Step 1 is likely best left to the ``DataContainer`` instances. |
0 commit comments