Skip to content

Commit 183851d

Browse files
committed
Add requirements for chunking and memory layout description
Also address some smaller review comments.
1 parent 2f51ba8 commit 183851d

File tree

1 file changed

+34
-9
lines changed

1 file changed

+34
-9
lines changed

protocol/dataframe_protocol_summary.md

Lines changed: 34 additions & 9 deletions
Original file line numberDiff line numberDiff line change
@@ -92,23 +92,41 @@ this is a consequence, and that that should be acceptable to them.
9292

9393
1. Must be a standard Python-level API that is unambiguously specified, and
9494
not rely on implementation details of any particular dataframe library.
95-
2. Must treat dataframes as a collection of columns (which are 1-D arrays
96-
with a dtype and missing data support).
97-
_Note: this related to the API for `__dataframe__`, and does not imply
95+
2. Must treat dataframes as a collection of columns (which are conceptually
96+
1-D arrays with a dtype and missing data support).
97+
_Note: this relates to the API for `__dataframe__`, and does not imply
9898
that the underlying implementation must use columnar storage!_
9999
3. Must allow the consumer to select a specific set of columns for conversion.
100100
4. Must allow the consumer to access the following "metadata" of the dataframe:
101101
number of rows, number of columns, column names, column data types.
102-
TBD: column data types wasn't clearly decided on, nor is it present in https://github.com/wesm/dataframe-protocol
103-
5. Must include device support
102+
_Note: this implies that a data type specification needs to be created._
103+
5. Must include device support.
104104
6. Must avoid device transfers by default (e.g. copy data from GPU to CPU),
105105
and provide an explicit way to force such transfers (e.g. a `force=` or
106106
`copy=` keyword that the caller can set to `True`).
107107
7. Must be zero-copy if possible.
108-
8. Must be able to support "virtual columns" (e.g., a library like Vaex which
109-
may not have data in memory because it uses lazy evaluation).
110-
9. Must support missing values (`NA`) for all supported dtypes.
111-
10. Must supports string and categorical dtypes
108+
8. Must support missing values (`NA`) for all supported dtypes.
109+
9. Must supports string and categorical dtypes.
110+
10. Must allow the consumer to inspect the representation for missing values
111+
that the producer uses for each column or data type.
112+
_Rationale: this enables the consumer to control how conversion happens,
113+
for example if the producer uses `-128` as a sentinel value in an `int8`
114+
column while the consumer uses a separate bit mask, that information
115+
allows the consumer to make this mapping._
116+
11. Must allow the producer to describe its memory layout in sufficient
117+
detail. In particular, for missing data and data types that may have
118+
multiple in-memory representations (e.g., categorical), those
119+
representations must all be describable in order to let the consumer map
120+
that to the representation it uses.
121+
_Rationale: prescribing a single in-memory representation in this
122+
protocol would lead to unnecessary copies being made if that represention
123+
isn't the native one a library uses._
124+
12. Must support chunking, i.e. accessing the data in "batches" of rows.
125+
There must be metadata the consumer can access to learn in how many
126+
chunks the data is stored. The consumer may also convert the data in
127+
more chunks than it is stored in, i.e. it can ask the producer to slice its columns to shorter length. That request may not be such that it would force the producer
128+
to concatenate data that is already stored in separate chunks.
129+
_Rationale: support for chunking is more efficient for libraries that natively store chunks, and it is needed for dataframes that do not fit in memory (e.g. dataframes stored on disk or lazily evaluated)._
112130

113131
We'll also list some things that were discussed but are not requirements:
114132

@@ -119,6 +137,13 @@ We'll also list some things that were discussed but are not requirements:
119137
May be added in the future (does have support in the Arrow C Data Interface)._
120138
3. Extension dtypes do not need to be supported.
121139
_Rationale: same as (2)_
140+
4. "virtual columns", i.e. columns for which the data is not yet in memory
141+
because it uses lazy evaluation, are not supported other than through
142+
letting the producer materialize the data in memory when the consumer
143+
calls `__dataframe__`.
144+
_Rationale: the full dataframe API will support this use case by
145+
"programming to an interface"; this data interchange protocol is
146+
fundamentally built around describing data in memory_.
122147

123148

124149
## Frequently asked questions

0 commit comments

Comments
 (0)