Skip to content

Add lower-precision integer and floating point data types, and packbits codec #3

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 2 commits into from
Apr 18, 2025
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
119 changes: 119 additions & 0 deletions codecs/packbits/README.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,119 @@
# packbits codec

Defines an `array -> bytes` codec that packs together values that are
represented by a fixed number of bits that is not necessarily a multiple of 8.

## Codec name

The value of the `name` member in the codec object MUST be `packbits`.

## Configuration parameters

### `padding_encoding` (Optional)

Specifies how the number of padding bits is encoded, such that the number of
decoded elements may be determined from the encoded representation alone.

Must be one of:
- `"first_byte"`, indicating that the first byte specifies the number of padding
bits that were added;
- `"last_byte"`, indicating that the final byte specifies the number of padding
bits that were added;
- `"none"` (default), indicating that the number of padding bits is not encoded.
In this case, the number of decoded elements cannot be determined from the
encoded representation alone.

While zarr itself does not need to be able to recover the number of decoded
elements from the encoded representation alone, because this information can be
propagated from the metadata through any prior codecs in the chain, it may still
be useful as an additional sanity check or for non-zarr uses of the codec.

A value of `"first_byte"` provides compatibility with the [numcodecs packbits
codec](https://github.com/zarr-developers/numcodecs/blob/3c933cf19d4d84f2efc5f3a36926d8c569514a90/numcodecs/packbits.py#L7)
defined for zarr v2 (which only supports `bool`).

### `first_bit` (Optional)

Specifies the index (starting from the least-significant bit) of the first bit
to be encoded. If omitted, or specified as `null`, defaults to `0`.

### `last_bit` (Optional)

Specifies the index (starting from the least-significant bit) of the (inclusive)
last bit to be encoded. If omitted, or specified as `null`, defaults to `N - 1`,
where `N` is the total number of bits per component of the data type (specified
below).

It is invalid for `last_bit` to be less than `first_bit`.

Note: for complex number data types, `first_bit` and `last_bit` apply to the
real and imaginary coefficients separately.

## Format and algorithm

This is an `array -> bytes` codec.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Would suggest we call this array -> bits. Packing to bits means we are dropping the byte-for-byte meaning of the data

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Are you saying that we should rename the "array -> bytes" codec concept in the zarr v3 spec to "array -> bits" instead, or are you saying we should introduce a new "array -> bits" codec concept?

"Array -> bytes" codecs convert what is logically an array into a byte sequence. That is exactly what this codec does also --- the input is an array, and the output is a byte sequence. Internally it deals with bits, but it ensures the output is padded up to a whole number of bytes.

Per the previous discussion, I guess an "array -> bits" codec could be a type of "array -> array" codec that produces an array of "rawbits" values where the number of bits is not a multiple of 8, but I'm not sure if that is what you have in mind.


### Encoding/decoding of individual array elements

Each element of the array is encoded as a fixed number of bits, `k`, where `k`
is determined from the data type, `first_bit`, and `last_bit`. Specifically

```
b := last_bit - first_bit + 1,
k := num_components * b,
```

where `num_components` is determined by the data type (2 for complex number data
types, 1 for all other data types).

Note: If `first_bit` and `last_bit` are both unspecified, `b == N`.

Logically, to encode an element of the array, each component is first encoded as
an `N`-bit value (retaining all bits). From this `N`-bit value, the `b` bits
from `first_bit` to `last_bit` are then extracted.

To decode an element of the array, the `b` encoded bits are first shifted to
`first_bit`. Depending on the data type, the value is then:

- sign-extended (for signed integer data types), or
- zero-extended (all other data types)

up to `N` bits.

### Encoding and decoding multiple

- Array elements are encoded in lexicographical order, to produce a bit
sequence; element `i` corresponds to bits `[i * k, (i+1) * k)` within the
sequence.
- The bit sequence is padded with 0 bits to ensure its length is a multiple of
8 bits.
- Encoded byte `i` corresponds to bits `[i * 8, (i+1) * 8)` within the sequence.
- If `padding_encoding` is `"first_byte"`, a single byte specifying
the number of padding bits that were added is prepended to the encoded byte
sequence.
- If `padding_encoding` is `"last_byte"`, a single byte specifying the number of
padding bits that were added is appended to the encoded byte sequence.

## Supported data types

- bool (encoded as 1 bit)
- int2, uint2 (encoded as 2 bits)
- int4, uint4, float4_e2m1fn (encoded as 4 bits)
- float6_e2m3fn, float6_e3m2fn (encoded as 6 bits)
- complex_float4_e2m1fn (encoded as 2 4-bit components)
- complex_float6_e2m3fn, complex_float6_e3m2fn (encoded as 2 6-bit components)
- int8, uint8, int16, uint16, int32, uint32, int64, uint64, float32, float64,
bfloat16, complex_float32, complex_float64, complex_bfloat16 (encoded using
their little-endian representation as with the "bytes" codec)

Note: For these types, if `first_bit` and `last_bit` are not used to limit the
number of bits that are encoded, this codec does not provide any benefit over
the `"bytes"` codec but is supported nonetheless for uniformity.

## Change log

No changes yet.

## Current maintainers

* Jeremy Maitin-Shepard ([@jbms](https://github.com/jbms)), Google
39 changes: 39 additions & 0 deletions codecs/packbits/schema.json
Original file line number Diff line number Diff line change
@@ -0,0 +1,39 @@
{
"$schema": "https://json-schema.org/draft/2020-12/schema",
"type": "object",
"properties": {
"name": {
"const": "packbits"
},
"configuration": {
"type": "object",
"properties": {
"padding_encoding": {
"enum": ["start_byte", "end_byte", "none"],
"default": "none"
},
"start_bit": {
"oneOf": [
{
"type": "integer",
"minimum": 0
},
{ "const": null }
]
},
"end_bit": {
"oneOf": [
{
"type": "integer",
"minimum": 0
},
{ "const": null }
]
}
},
"additionalProperties": false
}
},
"required": ["name"],
"additionalProperties": false
}
43 changes: 43 additions & 0 deletions data-types/bfloat16/README.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,43 @@
# bfloat16 data type

Defines the `bfloat16` floating-point data type
(https://en.wikipedia.org/wiki/Bfloat16_floating-point_format).

A `bfloat16` number is a IEEE 754 binary32 floating-point number truncated at
16-bits.

- 1 sign bit
- 8 exponent bits, with bias of 127
- 7 mantissa bits
- IEEE 754-compliant, with NaN and +/-inf.
- Subnormal numbers when biased exponent is 0.

## Data type name

The data type is specified as `"bfloat16"`.

## Configuration

None.

## Fill value representation

The fill value is specified in the same way as the core IEEE 754 floating point
numbers:
https://zarr-specs.readthedocs.io/en/latest/v3/core/v3.0.html#fill-value

The constant `"NaN"` corresponds to a representation of `"0x7fc0"`.

## Codec compatibility

### bytes

Encoded as a 2-byte little-endian or big-endian value `0bSEEEEEEEEMMMMMMM`.

## See also

A Python implementation is available at https://pypi.org/project/ml-dtypes/

## Current maintainers

* Jeremy Maitin-Shepard ([@jbms](https://github.com/jbms)), Google
22 changes: 22 additions & 0 deletions data-types/bfloat16/schema.json
Original file line number Diff line number Diff line change
@@ -0,0 +1,22 @@
{
"$schema": "https://json-schema.org/draft/2020-12/schema",
"oneOf": [
{
"type": "object",
"properties": {
"name": {
"const": "bfloat16"
},
"configuration": {
"type": "object",
"additionalProperties": false
}
},
"required": ["name"],
"additionalProperties": false
},
{
"const": "bfloat16"
}
]
}
28 changes: 28 additions & 0 deletions data-types/complex_bfloat16/README.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,28 @@
# complex_bfloat16 data type

Defines a complex number data type where the real and imaginary components are
represented by the `bfloat16` data type.

## Data type name

The data type is specified as `"complex_bfloat16"`.

## Configuration

None.

## Fill value representation

The fill value is specified in the same way as the core complex number data data types:
https://zarr-specs.readthedocs.io/en/latest/v3/core/v3.0.html#fill-value

## Codec compatibility

### bytes

Encoded as 2 consecutive (real component followed by imaginary component) 2-byte
little-endian or big-endian values.

## Current maintainers

* Jeremy Maitin-Shepard ([@jbms](https://github.com/jbms)), Google
22 changes: 22 additions & 0 deletions data-types/complex_bfloat16/schema.json
Original file line number Diff line number Diff line change
@@ -0,0 +1,22 @@
{
"$schema": "https://json-schema.org/draft/2020-12/schema",
"oneOf": [
{
"type": "object",
"properties": {
"name": {
"const": "complex_bfloat16"
},
"configuration": {
"type": "object",
"additionalProperties": false
}
},
"required": ["name"],
"additionalProperties": false
},
{
"const": "complex_bfloat16"
}
]
}
28 changes: 28 additions & 0 deletions data-types/complex_float16/README.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,28 @@
# complex_float16 data type

Defines a complex number data type where the real and imaginary components are
represented by the `float16` data type.

## Data type name

The data type is specified as `"complex_float16"`.

## Configuration

None.

## Fill value representation

The fill value is specified in the same way as the core complex number data data types:
https://zarr-specs.readthedocs.io/en/latest/v3/core/v3.0.html#fill-value

## Codec compatibility

### bytes

Encoded as 2 consecutive (real component followed by imaginary component) 2-byte
little-endian or big-endian values.

## Current maintainers

* Jeremy Maitin-Shepard ([@jbms](https://github.com/jbms)), Google
22 changes: 22 additions & 0 deletions data-types/complex_float16/schema.json
Original file line number Diff line number Diff line change
@@ -0,0 +1,22 @@
{
"$schema": "https://json-schema.org/draft/2020-12/schema",
"oneOf": [
{
"type": "object",
"properties": {
"name": {
"const": "complex_float16"
},
"configuration": {
"type": "object",
"additionalProperties": false
}
},
"required": ["name"],
"additionalProperties": false
},
{
"const": "complex_float16"
}
]
}
16 changes: 16 additions & 0 deletions data-types/complex_float32/README.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,16 @@
# complex_float32 data type

Defines an alias of the core `complex64` data type, for consistency with the new
complex number data types for which `complexNNN` naming is not sufficient.

## Data type name

The data type is specified as `"complex_float32"`.

## Configuration

None.

## Current maintainers

* Jeremy Maitin-Shepard ([@jbms](https://github.com/jbms)), Google
22 changes: 22 additions & 0 deletions data-types/complex_float32/schema.json
Original file line number Diff line number Diff line change
@@ -0,0 +1,22 @@
{
"$schema": "https://json-schema.org/draft/2020-12/schema",
"oneOf": [
{
"type": "object",
"properties": {
"name": {
"const": "complex_float32"
},
"configuration": {
"type": "object",
"additionalProperties": false
}
},
"required": ["name"],
"additionalProperties": false
},
{
"const": "complex_float32"
}
]
}
Loading