Skip to content

Commit a471301

Browse files
committed
Add a prototype of the dataframe interchange protocol
Related to requirements in gh-35. TBD (to be discussed) comments and design decisions at the top of the file indicate topics for closer review/discussion.
1 parent 6af8c2a commit a471301

File tree

1 file changed

+380
-0
lines changed

1 file changed

+380
-0
lines changed

protocol/dataframe_protocol.py

Lines changed: 380 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,380 @@
1+
"""
2+
Specification for objects to be accessed, for the purpose of dataframe
3+
interchange between libraries, via the ``__dataframe__`` method on a libraries'
4+
data frame object.
5+
6+
For guiding requirements, see https://github.com/data-apis/dataframe-api/pull/35
7+
8+
Design decisions
9+
----------------
10+
11+
**1. Use a separate column abstraction in addition to a dataframe interface.**
12+
13+
Rationales:
14+
- This is how it works in R, Julia and Apache Arrow.
15+
- Semantically most existing applications and users treat a column similar to a 1-D array
16+
- We should be able to connect a column to the array data interchange mechanism(s)
17+
18+
Note that this not apply a library must have such a public user-facing
19+
abstraction (ex. ``pandas.Series``) - it can only be accessed via ``__dataframe__``.
20+
21+
**2. Use methods and properties on an opaque object rather than returning
22+
hierarchical dictionaries describing memory**
23+
24+
This is better for implementation that may rely on, for example, lazy computation.
25+
Other small detail: plain attributes cannot be checked for (e.g. with
26+
`hasattr`) without side effects.
27+
28+
**3. No row names. If a library uses row names, use a regular column for them.**
29+
30+
See discussion at https://github.com/wesm/dataframe-protocol/pull/1/files#r394316241
31+
Optional row names are not a good idea, because people will assume they're present
32+
(see cuDF experience, forced to add because pandas has them).
33+
Requiring row names seems worse than leaving them out.
34+
35+
"""
36+
37+
38+
class Buffer:
39+
"""
40+
Data in the buffer is guaranteed to be contiguous in memory.
41+
"""
42+
43+
@property
44+
def bufsize(self) -> int:
45+
"""
46+
Buffer size in bytes
47+
"""
48+
pass
49+
50+
@property
51+
def ptr(self) -> int:
52+
"""
53+
Pointer to start of the buffer as an integer
54+
"""
55+
pass
56+
57+
def __dlpack__(self):
58+
"""
59+
Produce DLPack capsule (see array API standard).
60+
61+
Raises:
62+
63+
- TypeError : if the buffer contains unsupported dtypes.
64+
- NotImplementedError : if DLPack support is not implemented
65+
66+
Useful to have to connect to array libraries. Support optional because
67+
it's not completely trivial to implement for a Python-only library.
68+
"""
69+
raise NotImplementedError("__dlpack__")
70+
71+
def __array_interface__(self):
72+
"""
73+
TBD: implement or not? Will work for all dtypes except bit masks.
74+
"""
75+
raise NotImplementedError("__array_interface__")
76+
77+
78+
class Column:
79+
"""
80+
A column object, with only the methods and properties required by the
81+
interchange protocol defined.
82+
83+
A column can contain one or more chunks. Each chunk can contain either one
84+
or two buffers - one data buffer and (depending on null representation) it
85+
may have a mask buffer.
86+
87+
TBD: Arrow has a separate "null" dtype, and has no separate mask concept.
88+
Instead, it seems to use "children" for both columns with a bit mask,
89+
and for nested dtypes. Unclear whether this is elegant or confusing.
90+
This design requires checking the null representation explicitly.
91+
92+
The Arrow design requires checking:
93+
1. the ARROW_FLAG_NULLABLE (for sentinel values)
94+
2. if a column has two children, combined with one of those children
95+
having a null dtype.
96+
97+
Making the mask concept explicit seems useful
98+
99+
TBD: there's also the "chunk" concept here, which is implicit in Arrow as
100+
multiple buffers per array (= column here). Semantically it may make
101+
sense to have both: chunks were meant for example for lazy evaluation
102+
of data which doesn't fit in memory, while multiple buffers per column
103+
could also come from doing a selection operation on a single
104+
contiguous buffer.
105+
106+
Given these concepts, one would expect chunks to be all of the same
107+
size (say a 10,000 row dataframe could have 10 chunks of 1,000 rows),
108+
while multiple buffers could have data-dependent lengths. Not an issue
109+
in pandas if one column is backed by a single NumPy array, but in
110+
Arrow it seems possible.
111+
Are multiple chunks *and* multiple buffers per column necessary for
112+
the purposes of this interchange protocol, or must producers either
113+
reuse the chunk concept for this or copy the data?
114+
115+
Note: this Column object can only be produced by ``__dataframe__``, so
116+
doesn't need its own version or ``__column__`` protocol.
117+
118+
"""
119+
@property
120+
def name(self) -> str:
121+
pass
122+
123+
@property
124+
def size(self) -> Optional[int]:
125+
"""
126+
Size of the column, in elements.
127+
128+
Corresponds to DataFrame.num_rows() if column is a single chunk;
129+
equal to size of this current chunk otherwise.
130+
"""
131+
pass
132+
133+
@property
134+
def offset(self) -> int:
135+
"""
136+
Offset of first element
137+
138+
May be > 0 if using chunks; for example for a column with N chunks of
139+
equal size M (only the last chunk may be shorter),
140+
``offset = n * M``, ``n = 0 .. N-1``.
141+
"""
142+
pass
143+
144+
@property
145+
def dtype(self) -> Tuple[int, int, str, str]:
146+
"""
147+
Dtype description as a tuple ``(kind, bit-width, format string, endianness)``
148+
149+
Kind :
150+
151+
- 0 : signed integer
152+
- 1 : unsigned integer
153+
- 2 : IEEE floating point
154+
- 20 : boolean
155+
- 21 : string (UTF-8)
156+
- 22 : datetime
157+
- 23 : categorical
158+
159+
Bit-width : the number of bits as an integer
160+
Format string : data type description format string in Apache Arrow C
161+
Data Interface format.
162+
Endianness : current only native endianness (``=``) is supported
163+
164+
Notes:
165+
166+
- Kind specifiers are aligned with DLPack where possible (hence the
167+
jump to 20, leave enough room for future extension)
168+
- Masks must be specified as boolean with either bit width 1 (for bit
169+
masks) or 8 (for byte masks).
170+
- Dtype width in bits was preferred over bytes
171+
- Endianness isn't too useful, but included now in case in the future
172+
we need to support non-native endianness
173+
- Went with Apache Arrow format strings over NumPy format strings
174+
because they're more complete from a dataframe perspective
175+
- Format strings are mostly useful for datetime specification, and
176+
for categoricals.
177+
- For categoricals, the format string describes the type of the
178+
categorical in the data buffer. In case of a separate encoding of
179+
the categorical (e.g. an integer to string mapping), this can
180+
be derived from ``self.describe_categorical``.
181+
- Data types not included: complex, Arrow-style null, binary, decimal,
182+
and nested (list, struct, map, union) dtypes.
183+
"""
184+
pass
185+
186+
@property
187+
def describe_categorical(self) -> dict[bool, bool, Optional[dict]]:
188+
"""
189+
If the dtype is categorical, there are two options:
190+
191+
- There are only values in the data buffer.
192+
- There is a separate dictionary-style encoding for categorical values.
193+
194+
Raises RuntimeError if the dtype is not categorical
195+
196+
Content of returned dict:
197+
198+
- "is_ordered" : bool, whether the ordering of dictionary indices is
199+
semantically meaningful.
200+
- "is_dictionary" : bool, whether a dictionary-style mapping of
201+
categorical values to other objects exists
202+
- "mapping" : dict, Python-level only (e.g. ``{int: str}``).
203+
None if not a dictionary-style categorical.
204+
205+
TBD: are there any other in-memory representations that are needed?
206+
"""
207+
pass
208+
209+
@property
210+
def describe_null(self) -> Tuple[int, Any]:
211+
"""
212+
Return the missing value (or "null") representation the column dtype
213+
uses, as a tuple ``(kind, value)``.
214+
215+
Kind:
216+
217+
- 0 : NaN/NaT
218+
- 1 : sentinel value
219+
- 2 : bit mask
220+
- 3 : byte mask
221+
222+
Value : if kind is "sentinel value", the actual value. None otherwise.
223+
"""
224+
pass
225+
226+
@property
227+
def null_count(self) -> Optional[int]:
228+
"""
229+
Number of null elements, if known.
230+
231+
Note: Arrow uses -1 to indicate "unknown", but None seems cleaner.
232+
"""
233+
pass
234+
235+
def num_chunks(self) -> int:
236+
"""
237+
Return the number of chunks the column consists of.
238+
"""
239+
pass
240+
241+
def get_chunks(self, n_chunks : Optional[int] = None) -> Iterable[Column]:
242+
"""
243+
Return an iterator yielding the chunks.
244+
245+
See `DataFrame.get_chunks` for details on ``n_chunks``.
246+
"""
247+
pass
248+
249+
def get_buffer(self) -> Buffer:
250+
"""
251+
Return the buffer containing the data.
252+
"""
253+
pass
254+
255+
def get_mask(self) -> Buffer:
256+
"""
257+
Return the buffer containing the mask values indicating missing data.
258+
259+
Raises RuntimeError if null representation is not a bit or byte mask.
260+
"""
261+
pass
262+
263+
# # NOTE: not needed unless one considers nested dtypes
264+
# def get_children(self) -> Iterable[Column]:
265+
# """
266+
# Children columns underneath the column, each object in this iterator
267+
# must adhere to the column specification
268+
# """
269+
# pass
270+
271+
272+
class DataFrame:
273+
"""
274+
A data frame class, with only the methods required by the interchange
275+
protocol defined.
276+
277+
A "data frame" represents an ordered collection of named columns.
278+
A column's "name" must be a unique string.
279+
Columns may be accessed by name or by position.
280+
281+
This could be a public data frame class, or an object with the methods and
282+
attributes defined on this DataFrame class could be returned from the
283+
``__dataframe__`` method of a public data frame class in a library adhering
284+
to the dataframe interchange protocol specification.
285+
"""
286+
def __dataframe__(self, nan_as_null : bool = False) -> dict:
287+
"""
288+
Produces a dictionary object following the dataframe protocol spec
289+
"""
290+
self._nan_as_null = nan_as_null
291+
return {
292+
"dataframe": self, # DataFrame object adhering to the protocol
293+
"version": 0 # Version number of the protocol
294+
}
295+
296+
def num_columns(self) -> int:
297+
"""
298+
Return the number of columns in the DataFrame
299+
"""
300+
pass
301+
302+
def num_rows(self) -> Optional[int]:
303+
# TODO: not happy with Optional, but need to flag it may be expensive
304+
# why include it if it may be None - what do we expect consumers
305+
# to do here?
306+
"""
307+
Return the number of rows in the DataFrame, if available
308+
"""
309+
pass
310+
311+
def num_chunks(self) -> int:
312+
"""
313+
Return the number of chunks the DataFrame consists of
314+
"""
315+
pass
316+
317+
def column_names(self) -> Iterable[str]:
318+
"""
319+
Return an iterator yielding the column names.
320+
"""
321+
pass
322+
323+
def get_column(self, i: int) -> Column:
324+
"""
325+
Return the column at the indicated position.
326+
"""
327+
pass
328+
329+
def get_column_by_name(self, name: str) -> Column:
330+
"""
331+
Return the column whose name is the indicated name.
332+
"""
333+
pass
334+
335+
def get_columns(self) -> Iterable[Column]:
336+
"""
337+
Return an iterator yielding the columns.
338+
"""
339+
pass
340+
341+
def select_columns(self, indices: Sequence[int]) -> DataFrame:
342+
"""
343+
Create a new DataFrame by selecting a subset of columns by index
344+
"""
345+
pass
346+
347+
def select_columns_by_name(self, names: Sequence[str]) -> DataFrame:
348+
"""
349+
Create a new DataFrame by selecting a subset of columns by name.
350+
"""
351+
pass
352+
353+
def get_chunks(self, n_chunks : Optional[int] = None) -> Iterable[DataFrame]:
354+
"""
355+
Return an iterator yielding the chunks.
356+
357+
By default (None), yields the chunks that the data is stored as by the
358+
producer. If given, ``n_chunks`` must be a multiple of
359+
``self.num_chunks()``, meaning the producer must subdivide each chunk
360+
before yielding it.
361+
"""
362+
pass
363+
364+
@property
365+
def device(self) -> int:
366+
"""
367+
Device type the dataframe resides on.
368+
369+
Uses device type codes matching DLPack:
370+
371+
- 1 : CPU
372+
- 2 : CUDA
373+
- 3 : CPU pinned
374+
- 4 : OpenCL
375+
- 7 : Vulkan
376+
- 8 : Metal
377+
- 9 : Verilog
378+
- 10 : ROCm
379+
"""
380+
pass

0 commit comments

Comments
 (0)