You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Copy file name to clipboardExpand all lines: protocol/dataframe_protocol_summary.md
+34-9Lines changed: 34 additions & 9 deletions
Original file line number
Diff line number
Diff line change
@@ -92,23 +92,41 @@ this is a consequence, and that that should be acceptable to them.
92
92
93
93
1. Must be a standard Python-level API that is unambiguously specified, and
94
94
not rely on implementation details of any particular dataframe library.
95
-
2. Must treat dataframes as a collection of columns (which are 1-D arrays
96
-
with a dtype and missing data support).
97
-
_Note: this related to the API for `__dataframe__`, and does not imply
95
+
2. Must treat dataframes as a collection of columns (which are conceptually
96
+
1-D arrays with a dtype and missing data support).
97
+
_Note: this relates to the API for `__dataframe__`, and does not imply
98
98
that the underlying implementation must use columnar storage!_
99
99
3. Must allow the consumer to select a specific set of columns for conversion.
100
100
4. Must allow the consumer to access the following "metadata" of the dataframe:
101
101
number of rows, number of columns, column names, column data types.
102
-
TBD: column data types wasn't clearly decided on, nor is it present in https://github.com/wesm/dataframe-protocol
103
-
5. Must include device support
102
+
_Note: this implies that a data type specification needs to be created._
103
+
5. Must include device support.
104
104
6. Must avoid device transfers by default (e.g. copy data from GPU to CPU),
105
105
and provide an explicit way to force such transfers (e.g. a `force=` or
106
106
`copy=` keyword that the caller can set to `True`).
107
107
7. Must be zero-copy if possible.
108
-
8. Must be able to support "virtual columns" (e.g., a library like Vaex which
109
-
may not have data in memory because it uses lazy evaluation).
110
-
9. Must support missing values (`NA`) for all supported dtypes.
111
-
10. Must supports string and categorical dtypes
108
+
8. Must support missing values (`NA`) for all supported dtypes.
109
+
9. Must supports string and categorical dtypes.
110
+
10. Must allow the consumer to inspect the representation for missing values
111
+
that the producer uses for each column or data type.
112
+
_Rationale: this enables the consumer to control how conversion happens,
113
+
for example if the producer uses `-128` as a sentinel value in an `int8`
114
+
column while the consumer uses a separate bit mask, that information
115
+
allows the consumer to make this mapping._
116
+
11. Must allow the producer to describe its memory layout in sufficient
117
+
detail. In particular, for missing data and data types that may have
118
+
multiple in-memory representations (e.g., categorical), those
119
+
representations must all be describable in order to let the consumer map
120
+
that to the representation it uses.
121
+
_Rationale: prescribing a single in-memory representation in this
122
+
protocol would lead to unnecessary copies being made if that represention
123
+
isn't the native one a library uses._
124
+
12. Must support chunking, i.e. accessing the data in "batches" of rows.
125
+
There must be metadata the consumer can access to learn in how many
126
+
chunks the data is stored. The consumer may also convert the data in
127
+
more chunks than it is stored in, i.e. it can ask the producer to slice its columns to shorter length. That request may not be such that it would force the producer
128
+
to concatenate data that is already stored in separate chunks.
129
+
_Rationale: support for chunking is more efficient for libraries that natively store chunks, and it is needed for dataframes that do not fit in memory (e.g. dataframes stored on disk or lazily evaluated)._
112
130
113
131
We'll also list some things that were discussed but are not requirements:
114
132
@@ -119,6 +137,13 @@ We'll also list some things that were discussed but are not requirements:
119
137
May be added in the future (does have support in the Arrow C Data Interface)._
120
138
3. Extension dtypes do not need to be supported.
121
139
_Rationale: same as (2)_
140
+
4. "virtual columns", i.e. columns for which the data is not yet in memory
141
+
because it uses lazy evaluation, are not supported other than through
142
+
letting the producer materialize the data in memory when the consumer
143
+
calls `__dataframe__`.
144
+
_Rationale: the full dataframe API will support this use case by
145
+
"programming to an interface"; this data interchange protocol is
146
+
fundamentally built around describing data in memory_.
0 commit comments