-
Notifications
You must be signed in to change notification settings - Fork 7
Add lower-precision integer and floating point data types, and packbits codec #3
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Merged
Merged
Changes from all commits
Commits
Show all changes
2 commits
Select commit
Hold shift + click to select a range
File filter
Filter by extension
Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
There are no files selected for viewing
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,119 @@ | ||
# packbits codec | ||
|
||
Defines an `array -> bytes` codec that packs together values that are | ||
represented by a fixed number of bits that is not necessarily a multiple of 8. | ||
|
||
## Codec name | ||
|
||
The value of the `name` member in the codec object MUST be `packbits`. | ||
|
||
## Configuration parameters | ||
|
||
### `padding_encoding` (Optional) | ||
|
||
Specifies how the number of padding bits is encoded, such that the number of | ||
decoded elements may be determined from the encoded representation alone. | ||
|
||
Must be one of: | ||
- `"first_byte"`, indicating that the first byte specifies the number of padding | ||
bits that were added; | ||
- `"last_byte"`, indicating that the final byte specifies the number of padding | ||
bits that were added; | ||
- `"none"` (default), indicating that the number of padding bits is not encoded. | ||
In this case, the number of decoded elements cannot be determined from the | ||
encoded representation alone. | ||
|
||
While zarr itself does not need to be able to recover the number of decoded | ||
elements from the encoded representation alone, because this information can be | ||
propagated from the metadata through any prior codecs in the chain, it may still | ||
be useful as an additional sanity check or for non-zarr uses of the codec. | ||
|
||
A value of `"first_byte"` provides compatibility with the [numcodecs packbits | ||
codec](https://github.com/zarr-developers/numcodecs/blob/3c933cf19d4d84f2efc5f3a36926d8c569514a90/numcodecs/packbits.py#L7) | ||
defined for zarr v2 (which only supports `bool`). | ||
|
||
### `first_bit` (Optional) | ||
|
||
Specifies the index (starting from the least-significant bit) of the first bit | ||
to be encoded. If omitted, or specified as `null`, defaults to `0`. | ||
|
||
### `last_bit` (Optional) | ||
|
||
Specifies the index (starting from the least-significant bit) of the (inclusive) | ||
last bit to be encoded. If omitted, or specified as `null`, defaults to `N - 1`, | ||
where `N` is the total number of bits per component of the data type (specified | ||
below). | ||
|
||
It is invalid for `last_bit` to be less than `first_bit`. | ||
|
||
Note: for complex number data types, `first_bit` and `last_bit` apply to the | ||
real and imaginary coefficients separately. | ||
|
||
## Format and algorithm | ||
|
||
This is an `array -> bytes` codec. | ||
|
||
### Encoding/decoding of individual array elements | ||
|
||
Each element of the array is encoded as a fixed number of bits, `k`, where `k` | ||
is determined from the data type, `first_bit`, and `last_bit`. Specifically | ||
|
||
``` | ||
b := last_bit - first_bit + 1, | ||
k := num_components * b, | ||
``` | ||
|
||
where `num_components` is determined by the data type (2 for complex number data | ||
types, 1 for all other data types). | ||
|
||
Note: If `first_bit` and `last_bit` are both unspecified, `b == N`. | ||
|
||
Logically, to encode an element of the array, each component is first encoded as | ||
an `N`-bit value (retaining all bits). From this `N`-bit value, the `b` bits | ||
from `first_bit` to `last_bit` are then extracted. | ||
|
||
To decode an element of the array, the `b` encoded bits are first shifted to | ||
`first_bit`. Depending on the data type, the value is then: | ||
|
||
- sign-extended (for signed integer data types), or | ||
- zero-extended (all other data types) | ||
|
||
up to `N` bits. | ||
|
||
### Encoding and decoding multiple | ||
|
||
- Array elements are encoded in lexicographical order, to produce a bit | ||
sequence; element `i` corresponds to bits `[i * k, (i+1) * k)` within the | ||
sequence. | ||
- The bit sequence is padded with 0 bits to ensure its length is a multiple of | ||
8 bits. | ||
- Encoded byte `i` corresponds to bits `[i * 8, (i+1) * 8)` within the sequence. | ||
- If `padding_encoding` is `"first_byte"`, a single byte specifying | ||
the number of padding bits that were added is prepended to the encoded byte | ||
sequence. | ||
- If `padding_encoding` is `"last_byte"`, a single byte specifying the number of | ||
padding bits that were added is appended to the encoded byte sequence. | ||
|
||
## Supported data types | ||
|
||
- bool (encoded as 1 bit) | ||
- int2, uint2 (encoded as 2 bits) | ||
- int4, uint4, float4_e2m1fn (encoded as 4 bits) | ||
- float6_e2m3fn, float6_e3m2fn (encoded as 6 bits) | ||
- complex_float4_e2m1fn (encoded as 2 4-bit components) | ||
- complex_float6_e2m3fn, complex_float6_e3m2fn (encoded as 2 6-bit components) | ||
- int8, uint8, int16, uint16, int32, uint32, int64, uint64, float32, float64, | ||
bfloat16, complex_float32, complex_float64, complex_bfloat16 (encoded using | ||
their little-endian representation as with the "bytes" codec) | ||
|
||
Note: For these types, if `first_bit` and `last_bit` are not used to limit the | ||
number of bits that are encoded, this codec does not provide any benefit over | ||
the `"bytes"` codec but is supported nonetheless for uniformity. | ||
|
||
## Change log | ||
|
||
No changes yet. | ||
|
||
## Current maintainers | ||
|
||
* Jeremy Maitin-Shepard ([@jbms](https://github.com/jbms)), Google |
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,39 @@ | ||
{ | ||
"$schema": "https://json-schema.org/draft/2020-12/schema", | ||
"type": "object", | ||
"properties": { | ||
"name": { | ||
"const": "packbits" | ||
}, | ||
"configuration": { | ||
"type": "object", | ||
"properties": { | ||
"padding_encoding": { | ||
"enum": ["start_byte", "end_byte", "none"], | ||
"default": "none" | ||
}, | ||
"start_bit": { | ||
"oneOf": [ | ||
{ | ||
"type": "integer", | ||
"minimum": 0 | ||
}, | ||
{ "const": null } | ||
] | ||
}, | ||
"end_bit": { | ||
"oneOf": [ | ||
{ | ||
"type": "integer", | ||
"minimum": 0 | ||
}, | ||
{ "const": null } | ||
] | ||
} | ||
}, | ||
"additionalProperties": false | ||
} | ||
}, | ||
"required": ["name"], | ||
"additionalProperties": false | ||
} |
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,43 @@ | ||
# bfloat16 data type | ||
|
||
Defines the `bfloat16` floating-point data type | ||
(https://en.wikipedia.org/wiki/Bfloat16_floating-point_format). | ||
|
||
A `bfloat16` number is a IEEE 754 binary32 floating-point number truncated at | ||
16-bits. | ||
|
||
- 1 sign bit | ||
- 8 exponent bits, with bias of 127 | ||
- 7 mantissa bits | ||
- IEEE 754-compliant, with NaN and +/-inf. | ||
- Subnormal numbers when biased exponent is 0. | ||
|
||
## Data type name | ||
|
||
The data type is specified as `"bfloat16"`. | ||
|
||
## Configuration | ||
|
||
None. | ||
|
||
## Fill value representation | ||
|
||
The fill value is specified in the same way as the core IEEE 754 floating point | ||
numbers: | ||
https://zarr-specs.readthedocs.io/en/latest/v3/core/v3.0.html#fill-value | ||
|
||
The constant `"NaN"` corresponds to a representation of `"0x7fc0"`. | ||
|
||
## Codec compatibility | ||
|
||
### bytes | ||
|
||
Encoded as a 2-byte little-endian or big-endian value `0bSEEEEEEEEMMMMMMM`. | ||
|
||
## See also | ||
|
||
A Python implementation is available at https://pypi.org/project/ml-dtypes/ | ||
|
||
## Current maintainers | ||
|
||
* Jeremy Maitin-Shepard ([@jbms](https://github.com/jbms)), Google |
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,22 @@ | ||
{ | ||
"$schema": "https://json-schema.org/draft/2020-12/schema", | ||
"oneOf": [ | ||
{ | ||
"type": "object", | ||
"properties": { | ||
"name": { | ||
"const": "bfloat16" | ||
}, | ||
"configuration": { | ||
"type": "object", | ||
"additionalProperties": false | ||
} | ||
}, | ||
"required": ["name"], | ||
"additionalProperties": false | ||
}, | ||
{ | ||
"const": "bfloat16" | ||
} | ||
] | ||
} |
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,28 @@ | ||
# complex_bfloat16 data type | ||
|
||
Defines a complex number data type where the real and imaginary components are | ||
represented by the `bfloat16` data type. | ||
|
||
## Data type name | ||
|
||
The data type is specified as `"complex_bfloat16"`. | ||
|
||
## Configuration | ||
|
||
None. | ||
|
||
## Fill value representation | ||
|
||
The fill value is specified in the same way as the core complex number data data types: | ||
https://zarr-specs.readthedocs.io/en/latest/v3/core/v3.0.html#fill-value | ||
|
||
## Codec compatibility | ||
|
||
### bytes | ||
|
||
Encoded as 2 consecutive (real component followed by imaginary component) 2-byte | ||
little-endian or big-endian values. | ||
|
||
## Current maintainers | ||
|
||
* Jeremy Maitin-Shepard ([@jbms](https://github.com/jbms)), Google |
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,22 @@ | ||
{ | ||
"$schema": "https://json-schema.org/draft/2020-12/schema", | ||
"oneOf": [ | ||
{ | ||
"type": "object", | ||
"properties": { | ||
"name": { | ||
"const": "complex_bfloat16" | ||
}, | ||
"configuration": { | ||
"type": "object", | ||
"additionalProperties": false | ||
} | ||
}, | ||
"required": ["name"], | ||
"additionalProperties": false | ||
}, | ||
{ | ||
"const": "complex_bfloat16" | ||
} | ||
] | ||
} |
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,28 @@ | ||
# complex_float16 data type | ||
|
||
Defines a complex number data type where the real and imaginary components are | ||
represented by the `float16` data type. | ||
|
||
## Data type name | ||
|
||
The data type is specified as `"complex_float16"`. | ||
|
||
## Configuration | ||
|
||
None. | ||
|
||
## Fill value representation | ||
|
||
The fill value is specified in the same way as the core complex number data data types: | ||
https://zarr-specs.readthedocs.io/en/latest/v3/core/v3.0.html#fill-value | ||
|
||
## Codec compatibility | ||
|
||
### bytes | ||
|
||
Encoded as 2 consecutive (real component followed by imaginary component) 2-byte | ||
little-endian or big-endian values. | ||
|
||
## Current maintainers | ||
|
||
* Jeremy Maitin-Shepard ([@jbms](https://github.com/jbms)), Google |
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,22 @@ | ||
{ | ||
"$schema": "https://json-schema.org/draft/2020-12/schema", | ||
"oneOf": [ | ||
{ | ||
"type": "object", | ||
"properties": { | ||
"name": { | ||
"const": "complex_float16" | ||
}, | ||
"configuration": { | ||
"type": "object", | ||
"additionalProperties": false | ||
} | ||
}, | ||
"required": ["name"], | ||
"additionalProperties": false | ||
}, | ||
{ | ||
"const": "complex_float16" | ||
} | ||
] | ||
} |
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,16 @@ | ||
# complex_float32 data type | ||
|
||
Defines an alias of the core `complex64` data type, for consistency with the new | ||
complex number data types for which `complexNNN` naming is not sufficient. | ||
|
||
## Data type name | ||
|
||
The data type is specified as `"complex_float32"`. | ||
|
||
## Configuration | ||
|
||
None. | ||
|
||
## Current maintainers | ||
|
||
* Jeremy Maitin-Shepard ([@jbms](https://github.com/jbms)), Google |
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,22 @@ | ||
{ | ||
"$schema": "https://json-schema.org/draft/2020-12/schema", | ||
"oneOf": [ | ||
{ | ||
"type": "object", | ||
"properties": { | ||
"name": { | ||
"const": "complex_float32" | ||
}, | ||
"configuration": { | ||
"type": "object", | ||
"additionalProperties": false | ||
} | ||
}, | ||
"required": ["name"], | ||
"additionalProperties": false | ||
}, | ||
{ | ||
"const": "complex_float32" | ||
} | ||
] | ||
} |
Oops, something went wrong.
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Would suggest we call this
array -> bits
. Packing to bits means we are dropping thebyte
-for-byte
meaning of the dataThere was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Are you saying that we should rename the "array -> bytes" codec concept in the zarr v3 spec to "array -> bits" instead, or are you saying we should introduce a new "array -> bits" codec concept?
"Array -> bytes" codecs convert what is logically an array into a byte sequence. That is exactly what this codec does also --- the input is an array, and the output is a byte sequence. Internally it deals with bits, but it ensures the output is padded up to a whole number of bytes.
Per the previous discussion, I guess an "array -> bits" codec could be a type of "array -> array" codec that produces an array of "rawbits" values where the number of bits is not a multiple of 8, but I'm not sure if that is what you have in mind.