Skip to content

About

pydantic-zarr#

PyPI

Static typing and runtime validation for Zarr hierachies.

Overview#

pydantic-zarr expresses data stored in the Zarr format with Pydantic. Specifically, pydantic-zarr encodes Zarr groups and arrays as Pydantic models. These models are useful for formalizing the structure of Zarr hierarchies, type-checking Zarr hierarchies, and runtime validation for Zarr-based data.

import zarr
from pydantic_zarr.v2 import GroupSpec

# create a Zarr group
group = zarr.group(path='foo')
# put an array inside the group
array = zarr.create(store = group.store, path='foo/bar', shape=10, dtype='uint8')
array.attrs.put({'metadata': 'hello'})

# create a pydantic model to model the Zarr group
spec = GroupSpec.from_zarr(group)
print(spec.model_dump())
"""
{
    'zarr_version': 2,
    'attributes': {},
    'members': {
        'bar': {
            'zarr_version': 2,
            'attributes': {'metadata': 'hello'},
            'shape': (10,),
            'chunks': (10,),
            'dtype': '|u1',
            'fill_value': 0,
            'order': 'C',
            'filters': None,
            'dimension_separator': '.',
            'compressor': {
                'id': 'blosc',
                'cname': 'lz4',
                'clevel': 5,
                'shuffle': 1,
                'blocksize': 0,
            },
        }
    },
}
"""

More examples can be found in the usage guide.

Installation#

pip install -U pydantic-zarr

Limitations#

No array data operations#

This library only provides tools to represent the layout of Zarr groups and arrays, and the structure of their attributes. pydantic-zarr performs no type checking or runtime validation of the multidimensional array data contained inside Zarr arrays, and pydantic-zarr does not contain any tools for efficiently reading or writing Zarr arrays.

Supported Zarr versions#

This library supports version 2 of the Zarr format, with partial support for Zarr v3. Progress towards complete support for Zarr v3 is tracked by this issue.

Design#

A Zarr group can be modeled as an object with two properties:

  • attributes: A dict-like object, with keys that are strings, values that are JSON-serializable.
  • members: A dict-like object, with keys that strings and values that are other Zarr groups, or Zarr arrays.

A Zarr array can be modeled similarly, but without the members property (because Zarr arrays cannot contain Zarr groups or arrays), and with a set of array-specific properties like shape, dtype, etc.

Note the use of the term "modeled": Zarr arrays are useful because they store N-dimensional array data, but pydantic-zarr does not treat that data as part of the "model" of a Zarr array.

In pydantic-zarr, Zarr groups are modeled by the GroupSpec class, which is a Pydantic model with two fields:

  • attributes: either a Mapping or a pydantic.BaseModel.
  • members: either a mapping with string keys and values that must be GroupSpec or ArraySpec instances, or the value Null. The use of nullability is explained in its own section.

Zarr arrays are represented by the ArraySpec class, which has a similar attributes field, as well as fields for all the Zarr array properties (dtype, shape, chunks, etc).

GroupSpec and ArraySpec are both generic models. GroupSpec takes two type parameters, the first specializing the type of GroupSpec.attributes, and the second specializing the type of the values of GroupSpec.members (the keys of GroupSpec.members are always strings). ArraySpec only takes one type parameter, which specializes the type of ArraySpec.attributes.

Examples using this generic typing functionality can be found in the usage guide.

Nullable members#

When a Zarr group has no members, a GroupSpec model of that Zarr group will have its members attribute set to the empty dict {}. But there are scenarios where the members of a Zarr group are unknown:

  • Some Zarr storage backends do not support directory listing, in which case it is possible to access a Zarr group and inspect its attributes, but impossible to discover its members. So the members of such a Zarr group are unknown.
  • Traversing a deeply nested large Zarr group on high latency storage can be slow. This can be mitigated by only partially traversing the hierarchy, e.g. only inspecting the root group and N subgroups. This defines a sub-hierarchy of the full hierarchy; leaf groups of this subtree by definition did not have their members checked, and so their members are unknown.
  • A Zarr hierarchy can be represented as a mapping M from paths to nodes (array or group). In this case, if M["key"] is a model of a Zarr group G, then M["key/subkey"] would encode a member of G. Since the key structure of the mapping M is doing the work of encoding the members of G, there is no value in G having a members attribute that claims anything about the members of G, and so G.members should be modeled as unknown.

To handle these cases, pydantic-zarr allows the members attribute of a GroupSpec to be Null.

Standardization#

The Zarr specifications do not define a model of the Zarr hierarchy. pydantic-zarr is an implementation of a particular model that can be found formalized in this specification document, which has been proposed for inclusion in the Zarr specifications. You can find the discussion of that proposal in this pull request.