Skip to content

Commit 308caca

Browse files
committed
DOC: write out a prose narrative of the proposed design
1 parent 3eccc70 commit 308caca

File tree

4 files changed

+267
-2
lines changed

4 files changed

+267
-2
lines changed

data_prototype/wrappers.py

+4-1
Original file line numberDiff line numberDiff line change
@@ -126,6 +126,7 @@ def _query_and_transform(self, renderer, *, xunits: List[str], yunits: List[str]
126126
# actually query the underlying data. This returns both the (raw) data
127127
# and key to use for caching.
128128
bb_size = ax_bbox.size
129+
# Step 1
129130
data, cache_key = self.data.query(
130131
# TODO do this needs to be (de) unitized
131132
# TODO figure out why caching this did not work
@@ -138,14 +139,16 @@ def _query_and_transform(self, renderer, *, xunits: List[str], yunits: List[str]
138139
return self._cache[cache_key]
139140
except KeyError:
140141
...
141-
# TODO decide if units go pre-nu or post-nu?
142+
143+
# Step 2
142144
for x_like in xunits:
143145
if x_like in data:
144146
data[x_like] = ax.xaxis.convert_units(data[x_like])
145147
for y_like in yunits:
146148
if y_like in data:
147149
data[y_like] = ax.xaxis.convert_units(data[y_like])
148150

151+
# Step 3
149152
# doing the nu work here is nice because we can write it once, but we
150153
# really want to push this computation down a layer
151154
# TODO sort out how this interoperates with the transform stack

docs/source/conf.py

+6-1
Original file line numberDiff line numberDiff line change
@@ -55,6 +55,7 @@
5555
plot_html_show_source_link = False
5656
plot_html_show_formats = False
5757

58+
hmathjax_path = "https://cdn.jsdelivr.net/npm/mathjax@2/MathJax.js?config=TeX-AMS-MML_HTMLorMML"
5859

5960
# Generate the API documentation when building
6061
autosummary_generate = False
@@ -157,7 +158,7 @@ def matplotlib_reduced_latex_scraper(block, block_vars, gallery_conf, **kwargs):
157158
# The theme to use for HTML and HTML Help pages. See the documentation for
158159
# a list of builtin themes.
159160
#
160-
html_theme = "mpl_sphinx_theme"
161+
# html_theme = "mpl_sphinx_theme"
161162

162163

163164
# Theme options are theme-specific and customize the look and feel of a theme
@@ -250,4 +251,8 @@ def matplotlib_reduced_latex_scraper(block, block_vars, gallery_conf, **kwargs):
250251
"scipy": ("https://docs.scipy.org/doc/scipy/reference/", None),
251252
"pandas": ("https://pandas.pydata.org/pandas-docs/stable", None),
252253
"matplotlib": ("https://matplotlib.org/stable", None),
254+
"networkx": ("https://networkx.org/documentation/stable", None),
253255
}
256+
257+
258+
default_role = 'obj'

docs/source/design.rst

+241
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,241 @@
1+
========
2+
Design
3+
========
4+
5+
When a Matplotlib :obj:`~matplotlib.artist.Artist` object in rendered via the `~matplotlib.artist.Artist.draw` method the following
6+
steps happen (in spirit but maybe not exactly in code):
7+
8+
1. get the data
9+
2. convert from unit-full to unit-less data
10+
3. convert the unit-less data from user-space to rendering-space
11+
4. call the backend rendering functions
12+
13+
..
14+
If we were to call these steps :math:`f_1` through :math:`f_4` this can be expressed as (taking
15+
great liberties with the mathematical notation):
16+
17+
.. math::
18+
19+
R = f_4(f_3(f_2(f_1())))
20+
21+
or if you prefer
22+
23+
.. math::
24+
25+
R = (f_4 \circ f_3 \circ f_2 \circ f_1)()
26+
27+
It is reasonable that if we can do this for one ``Artist``, we can build up
28+
more complex visualizations by rendering multiple ``Artist`` to the same
29+
target.
30+
31+
However, this clear structure is frequently elided and obscured in the
32+
Matplotlib code base: Step 3 is only present for *x* and *y* like data (encoded
33+
in the `~matplotlib.transforms.TransformNode` objects) and color mapped data
34+
(implemented in the `.matplotlib.colors.ScalarMappable` family of classes); the
35+
application of Step 2 is inconsistent (both in actual application and when it
36+
is applied) between artists; each ``Artist`` stores it's data in its own way
37+
(typically as numpy arrays).
38+
39+
With this view, we can understand the `~matplotlib.artist.Artist.draw` methods
40+
to be very extensively `curried
41+
<https://en.wikipedia.org/wiki/Curry_(programming_language)>`__ version of
42+
these function chains where the objects allow us to modify the arguments to the
43+
functions.
44+
45+
The goal of this work is to bring this structure more the foreground in the internal of
46+
Matplotlib to make it easier to reason about, easier to extend, and easier to inject
47+
custom logic at each of the steps
48+
49+
A paper with the formal mathematical description of these ideas is in
50+
preparation.
51+
52+
Data pipeline
53+
=============
54+
55+
Get the data (Step 1)
56+
---------------------
57+
58+
Currently, almost all ``Artist`` class store the data associated with them as
59+
attributes on the instances as `numpy.array` objectss. On one hand, this can
60+
be very useful as historically data was frequently already in `numpy.array`
61+
objects and, if you know the right methods for *this* ``Artist`` you can access
62+
that state to update or query it. From a certain point of view, this is
63+
consistent with the scheme laid out above as ``self.x[:]`` is really
64+
``self.x.__getitem__(slice())`` which is (technically) a function call.
65+
66+
However, this has several drawbacks. In most cases the data attributes on an
67+
``Artist`` are closely linked -- the *x* and *y* on a
68+
`~matplotlib.lines.Line2D` must be the same length -- and by storing them
69+
separately it is possible that they will get out of sync in problematic ways.
70+
Further, because the data is stored as materialized ``numpy`` arrays, there we
71+
must decide before draw time what the correct sampling of the data is. While
72+
there are some projects like `grave <https://networkx.org/grave/>`__ that wrap
73+
richer objects or `mpl-modest-image
74+
<https://github.com/ChrisBeaumont/mpl-modest-image>`__, `datashader
75+
<https://datashader.org/getting_started/Interactivity.html#native-support-for-matplotlib>`__,
76+
and `mpl-scatter-density <https://github.com/astrofrog/mpl-scatter-density>`__
77+
that dynamically re-sample the data these are niche libraries.
78+
79+
The first goal of this project is to bring support for draw-time resampleing to
80+
every Matplotlib ``Artist`` out of the box. The current approach is to move
81+
all of the data storage off of the ``Artist`` directly and into a (so-called)
82+
`~data_prototype.containers.DataContainer` instance. The primary method on these objects
83+
is the `~data_prototype.containers.DataContainer.query` method which has the signature ::
84+
85+
def query(
86+
self,
87+
transform: _Transform,
88+
size: Tuple[int, int],
89+
) -> Tuple[Dict[str, Any], Union[str, int]]:
90+
91+
The query is passed in:
92+
93+
- A transform from "Axes" to "data" (using Matplotlib's names for the `various
94+
coordinate systems
95+
<https://matplotlib.org/stable/tutorials/advanced/transforms_tutorial.html>`__
96+
- A notion of how big the axes is in "pixels" to provide guidance on what the correct number
97+
of samples to return is.
98+
99+
It will return:
100+
101+
- A mapping of strings to things that is coercible (with the help of the
102+
functions is steps 2 and 3) to a numpy array or types understandable by the
103+
backends.
104+
- A key that can be used for caching
105+
106+
This function will be called at draw time by the ``Aritist`` to get the data to
107+
be drawn. In the simplest cases
108+
(e.g. `~data_prototype.containers.ArrayContainer` and
109+
`~data_prototype.containers.DataFrameContainer`) the ``query`` method ignores
110+
the input and returns the data as-is. However, based on these inputs it is
111+
possible for the ``query`` method to get the data limits, even sampling in
112+
screen space, and an approximate estimate of the resolution of the
113+
visualization. This also opens up several interesting possibilities:
114+
115+
1. "Pure function" containers (such as
116+
`~data_prototype.containers.FuncContainer`) which will dynamically sample a
117+
function at "a good resolution" for the current data limits and screen size.
118+
2. A "resampling" container that either down-samples or slices the data it holds based on
119+
the current view limits.
120+
3. A container that makes a network or database call and automatically refreshes the data
121+
as a function of time.
122+
4. Containers that do binning or aggregation of the user data (such as
123+
`~data_prototype.containers.HistContainer`).
124+
125+
By accessing all of the data that is needed in draw in a single function call
126+
the ``DataContainer`` instances can ensure that the data is coherent and
127+
consistent. This is important for applications like steaming where different
128+
parts of the data may be arriving at different rates and it would thus be the
129+
``DataContainer``'s responsibility to settle any race conditions and always
130+
return aligned data to the ``Artist``.
131+
132+
133+
There is still some ambiguity as to what should be put in the data. For
134+
example with `~matplotlib.lines.Line2D` it is clear that the *x* and *y* data
135+
should be pulled from the ``DataConatiner``, but things like *color* and
136+
*linewidth* are ambiguous. A later section will make the case that it should be
137+
possible, but maybe not required, that these values be accessible in the data
138+
context.
139+
140+
An additional task that the ``DataContainer`` can do is to describe the type,
141+
shape, fields, and topology of the data it contains. For example a
142+
`~matplotlib.lines.Line2D` needs an *x* and *y* that are the same length, but
143+
`~matplotlib.patches.StepPatch` (which is also a 2D line) needs a *x* that is
144+
one longer than the *y*. The difference is that a ``Line2D`` in points with
145+
values which can be continuously interpolated between and ``StepPatch`` is bin
146+
edges with a constant value between the edges. This design lets us make
147+
explicit the implicit encoding of this sort of distinction in Matplotlib and be
148+
able to programatically operate on it. The details of exactly how to encode
149+
all of this still needs to be developed. There is a
150+
`~data_prototype.containers.DataContainer.describe` method, however it is the
151+
most provisional part of the current design.
152+
153+
154+
Unit conversion (Step 2)
155+
------------------------
156+
157+
Real data almost always has some units attached to it. Historically, this
158+
information can be carried "out of band" in the structure of the code or in
159+
custom containers or data types that are unit-aware. The recent work on ``numpy`` to
160+
make ``np.dtype`` more easily extendable is likely to make unit-full data much more
161+
common and easier to work with in the future.
162+
163+
In principle the user should be able to plot sets of data, one of them in *ft*
164+
the other in *m* and then show the ticks in *in* and then switch to *cm* and
165+
have everything "just work" for all plot types. Currently we are very far from
166+
this due to some parts of the code eagerly converting to the unit-less
167+
representation and not keeping the original, some parts of the code failing to
168+
do the conversion at all, some parts doing the conversion after coercing to
169+
``numpy`` and losing the unit information, etc. Further, because the data
170+
access and processing pipeline is done differently in every ``Artist`` it is a
171+
constant game of whack-a-bug to keep this working. If we adopt the consistent
172+
``DataContainer`` model for accessing the data and call
173+
`~data_prototype.containers.DataContainer.query` at draw time we will have a
174+
consistent place to also do the unit conversion.
175+
176+
The ``DataContainer`` can also carry inspectable information about what the
177+
units of its data are in which would make it possible to do ahead-of-time
178+
verification that the data of all of the ``Artists`` in an ``Axes`` are
179+
consistent with unit converters on the ``Axis``.
180+
181+
182+
Convert for rendering (Step 3)
183+
------------------------------
184+
185+
The next step is to get the data from unit-less "user data" into something that
186+
the backend renderer understand. This can range from coordinate
187+
transformations (as with the ``Transfrom`` stack operations on *x* and *y* like
188+
values), representation conversions (like named colors to RGB values), mapping
189+
stings to a set of objects (like named markershape), to paraaterized type
190+
conversion (like colormapping). Although Matplotlib is currently doing all of
191+
these conversions, the user really only has control of the position and
192+
colormapping (on `~matplotlib.colors.ScalarMappable` sub-classes). The next
193+
thing that this design allows is for user defined functions to be passed for
194+
any of the relevant data fields.
195+
196+
This will open up paths to do a number of nice things such as multi-variate
197+
color maps, lines who's width and color vary along their length, constant but
198+
parameterized colors and linestyles, and a version of ``scatter`` where the
199+
marker shape depends on the data. All of these things are currently possible
200+
in Matplotlib, but require significant work before calling Matplotlib and can
201+
be very difficult to update after the fact.
202+
203+
Pass to backend (Step 4)
204+
------------------------
205+
206+
This part of the process is proposed to remain unchanged from current
207+
Matplotlib. The calls to the underlying ``Renderer`` objects in ``draw``
208+
methods have stood the test of time and changing them is out of scope for the
209+
current work. In the future we may want to consider eliding Steps 3 and 4 in
210+
some cases for performance reasons to be able push the computation down to a
211+
GPU.
212+
213+
214+
Caching
215+
=======
216+
217+
A key to keeping this implementation efficient is to be able to cache when we
218+
have to re-compute values. Internally current Matplotlib has a number of
219+
ad-hoc caches, such as in ``ScalarMappable`` and ``Line2D``. Going down the
220+
route of hashing all of the data is not a sustainable path (in the case even
221+
modestly sized data the time to hash the data will quickly out-strip any
222+
possible time savings doing the cache lookup!). The proposed ``query`` method
223+
returns a cache key that it generates to the caller. The exact details of how
224+
to generate that key are left to the ``DataContainer`` implementation, but if
225+
the returned data changed, then the cache key must change. The cache key
226+
should be computed from a combination of the ``DataContainers`` internal state,
227+
the transform and size passed in.
228+
229+
The choice to return the data and cache key in one step, rather than be a two
230+
step process is drive by simplicity and because the cache key is computed
231+
inside of the ``query`` call. If computing the cache key is fast and the data
232+
to be returned in "reasonable" for the machine Matplotlib is running on (it
233+
needs to be or we won't render!), then if it makes sense to cache the results
234+
it can be done by the ``DataContainer`` and returned straight away along with
235+
the computed key.
236+
237+
There will need to be some thought put into cache invalidation and size
238+
management at the ``Artist`` layer. We also need to determine how many cache
239+
layers to keep. Currently only the results of Step 3 are cached, but we may
240+
want to additionally cache intermediate results after Step 2. The caching from
241+
Step 1 is likely best left to the ``DataContainer`` instances.

docs/source/index.rst

+16
Original file line numberDiff line numberDiff line change
@@ -12,13 +12,29 @@ repository should be considered experimental and used at you own risk.
1212

1313
Source : https://github.com/matplotlib/data-prototype
1414

15+
Design
16+
------
17+
.. toctree::
18+
:maxdepth: 2
19+
20+
design.rst
21+
22+
1523
Examples
1624
--------
1725

1826
.. toctree::
1927
:maxdepth: 2
2028

2129
gallery/index.rst
30+
31+
Reference
32+
---------
33+
34+
35+
.. toctree::
36+
:maxdepth: 2
37+
2238
api/index.rst
2339

2440
Backmatter

0 commit comments

Comments
 (0)