# Structures

Tiled describes data in one of a fixed group of standard *structure families*.
These are *not* Python-specific structures. They can be encoded in standard,
language-agnostic formats and transferred from the service to a client in
potentially any language.

## Supported structure families

The structure families are:

* array --- a strided array, like a [numpy](https://numpy.org) array
* awkward --- nested, variable-sized data (as implemented by [AwkwardArray](https://awkward-array.org/))
* container --- a of other structures, akin to a dictionary or a directory
* sparse --- a sparse array (i.e. an array which is mostly zeros)
* table --- tabular data, as in [Apache Arrow](https://arrow.apache.org) or
  [pandas](https://pandas.pydata.org/)

## How structure is encoded

Tiled can describe a structure---its shape, chunking, labels, and so on---for
the client so that the  client can intelligently request the pieces that it
wants.

The structures encodings are designed to be as unoriginal as possible, using
established standards and, where some invention is required, using established
names from numpy, pandas/Arrow, and dask.

Some structures are encoded in two parts:

## Examples

These examples were generated by serving the demo tree

```
tiled serve pyobject --public tiled.examples.generated:tree
```

making an HTTP request with [httpie](https://httpie.io/)
and then extracting the portion of interest with
[jq](https://stedolan.github.io/jq/), as shown below.

### Array (single chunk)

An array is described with a shape, chunk sizes, and a data type.
The parameterization and spelling of the data type follows the
[numpy `__array_interface__` protocol](https://numpy.org/doc/stable/reference/arrays.interface.html#object.__array_interface__).
Both built-in data types and
[structured data types](https://numpy.org/doc/stable/user/basics.rec.html) are supported.


An optional field, `dims` ("dimensions") may contain a list with
a string label for each dimension.

This `(10, 10)`-shaped array fits in a single `(10, 10)`-shaped chunk.

```
$ http :8000/api/v1/metadata/small_image | jq .data.attributes.structure
```

```json
{
  "chunks": [
    [
      100
    ],
    [
      100
    ]
  ],
  "shape": [
    100,
    100
  ],
  "dims": null,
  "resizable": false,
  "data_type": {
    "endianness": "little",
    "kind": "f",
    "itemsize": 8
  }
}
```

### Array (multiple chunks)

This `(10000, 10000)`-shaped array is subdivided into 4 × 4 = 16 chunks,
`(2500, 2500)`. Chunks do *not* in general have to be equally-sized,
which is why the size of each chunk is given explicitly.

```
$ http :8000/api/v1/metadata/big_image | jq .data.attributes.structure
```

```json
{
  "chunks": [
    [
      2500,
      2500,
      2500,
      2500
    ],
    [
      2500,
      2500,
      2500,
      2500
    ]
  ],
  "shape": [
    10000,
    10000
  ],
  "dims": null,
  "resizable": false,
  "data_type": {
    "endianness": "little",
    "kind": "f",
    "itemsize": 8
  }
}
```

### Array (with a structured data type)

This is a 1D array where each item has internal structure,
as in numpy's [structured data types](https://numpy.org/doc/stable/user/basics.rec.html)

```
$ http :8000/api/v1/metadata/structured_data/pets | jq .data.attributes.structure
```

```json
{
  "chunks": [
    [
      2
    ]
  ],
  "shape": [
    2
  ],
  "dims": null,
  "resizable": false,
  "data_type": {
    "itemsize": 48,
    "fields": [
      {
        "name": "name",
        "dtype": {
          "endianness": "little",
          "kind": "U",
          "itemsize": 40
        },
        "shape": null
      },
      {
        "name": "age",
        "dtype": {
          "endianness": "little",
          "kind": "i",
          "itemsize": 4
        },
        "shape": null
      },
      {
        "name": "weight",
        "dtype": {
          "endianness": "little",
          "kind": "f",
          "itemsize": 4
        },
        "shape": null
      }
    ]
  }
}
```

### Awkward

[AwkwardArrays](https://awkward-array.org/) express nested, variable-sized
data, including arbitrary-length lists, records, mixed types, and missing data.
This often comes up in the context of event-based data, such as is used in
high-energy physics, neutron experiments, quantum computing, and very high-rate
detectors.

AwkwardArrays are specified by:

* An outer `length` (always an integer)
* A JSON `form` (specified by AwkwardArray, giving the internal layout)
* Named buffers of bytes, whose names match information in the `form`

The first two are included in the structure.

```
$ http :8000/api/v1/metadata/awkward_array | jq .data.attributes.structure
```

```json
{
  "length": 3,
  "form": {
    "class": "ListOffsetArray",
    "offsets": "i64",
    "content": {
      "class": "RecordArray",
      "fields": [
        "x",
        "y"
      ],
      "contents": [
        {
          "class": "NumpyArray",
          "primitive": "float64",
          "inner_shape": [],
          "parameters": {},
          "form_key": "node2"
        },
        {
          "class": "ListOffsetArray",
          "offsets": "i64",
          "content": {
            "class": "NumpyArray",
            "primitive": "int64",
            "inner_shape": [],
            "parameters": {},
            "form_key": "node4"
          },
          "parameters": {},
          "form_key": "node3"
        }
      ],
      "parameters": {},
      "form_key": "node1"
    },
    "parameters": {},
    "form_key": "node0"
  }
}

```

### Sparse Array

There are a variety of ways to represent
[sparse arrays](https://en.wikipedia.org/wiki/Sparse_matrix).
The [Coordinate list (COO)](https://en.wikipedia.org/wiki/Sparse_matrix#Coordinate_list_(COO))
layout consists of writing the coordinate (e.g. row, column, or N-dimensional position)
and value of each nonzero element. A N-dimensional COO array with M nonzero
elements is described by an NxM array of coordinates and a length-M array of values.

Tiled describes this as a chunked array where each chunk contains a table
of coordinates and their values. The data types within each table are not described
at this level; they are self-described by the individual chunk payloads.

The key `layout` is always set to `COO` currently. In the future if other sparse
representations are supported, this key will be used to indicate which is used.

```json
{
  "shape": [
    100,
    100
  ],
  "chunks": [
    [
      100
    ],
    [
      100
    ]
  ],
  "dims": null,
  "resizable": false
}
```

### Table

With tables, we speak of "partitions" instead of "chunks". There are a
couple important distinctions. We always know the size of chunk before we ask
for it, but we will not know the number of rows in a partition until we
actually read it and enumerate them. Therefore, we cannot slice into
table the same way that we can slice in to arrays. We can ask for a
subset of the *columns*, and we can fetch partitions one at a time in any
order, but we cannot make requests like "rows 100-200". (Dask has the same
limitation, for the same reason.)

```
$ http :8000/api/v1/metadata/long_table | jq .data.attributes.structure
```

```json
{
  "npartitions": 5,
  "columns": [
    "A",
    "B",
    "C"
  ],
  "resizable": false
  "arrow_schema": "data:application/vnd.apache.arrow.file;base64,...",
}
```

The structure contains a base64-encoded Apache Arrow schema. Apache Arrow
is a binary format. It explicitly does not support JSON.  (There is a JSON
implementation, but the documentation states that it is intended only for
integration testing and should not be used by external code.) Therefore,
we base64-encode it.

### Container

This structure is a container for other structures. It may be compared to a
directory, a JSON object, a Python dictionary, or an HDF5 Group. Containers may
contain other containers, any other structure, or a mixture.

Some may contain a small number of nodes, easy to list in a single request,
while others may contain many listed via multiple paginated requests. Some Tiled
deployments currently in use have containers with up to hundreds of thousands
of nodes.

Typically, a container's structure tell us only how many nodes it contains (`count`).
The `contents` key is typically set to `null`, which indicates that we will need
a separate request to fetch information about each child node.

```json
{
  "contents": null,
  "count": 2
}
```

The representation of Containers, like other Tiled structures, can be customized
using the mechanism of specs.

#### Special Case 1. Xarray

In certain cases, it is efficient to in-line all the information about the
container's contents (their metadata, structure, and more) in a single
response.


```json
{
  "contents": {
    "lat": {
      "attributes": {
        "ancestors": [
          "structured_data",
          "xarray_dataset"
        ],
        "metadata": {},
        "sorting": null,
        "specs": [
          "xarray_coord"
        ],
        "structure": {
          "chunks": [
            [
              2
            ],
            [
              2
            ]
          ],
          "dims": [
            "x",
            "y"
          ],
          "resizable": false,
          "shape": [
            2,
            2
          ],
          "data_type": {
            "endianness": "little",
            "itemsize": 8,
            "kind": "f"
          }
        },
        "structure_family": "array"
      },
      "id": "lat",
      "links": {
        "block": "http://localhost:8000/api/v1/array/block/structured_data/xarray_dataset/lat?block={index_0},{index_1}",
        "full": "http://localhost:8000/api/v1/array/full/structured_data/xarray_dataset/lat",
        "self": "http://localhost:8000/api/v1/metadata/structured_data/xarray_dataset/lat"
      },
      "meta": null
    },
    "lon": {
      "attributes": {
        "ancestors": [
          "structured_data",
          "xarray_dataset"
        ],
        "metadata": {},
        "sorting": null,
        "specs": [
          "xarray_coord"
        ],
        "structure": {
          "chunks": [
            [
              2
            ],
            [
              2
            ]
          ],
          "dims": [
            "x",
            "y"
          ],
          "resizable": false,
          "shape": [
            2,
            2
          ],
          "data_type": {
            "endianness": "little",
            "itemsize": 8,
            "kind": "f"
          }
        },
        "structure_family": "array"
      },
      "id": "lon",
      "links": {
        "block": "http://localhost:8000/api/v1/array/block/structured_data/xarray_dataset/lon?block={index_0},{index_1}",
        "full": "http://localhost:8000/api/v1/array/full/structured_data/xarray_dataset/lon",
        "self": "http://localhost:8000/api/v1/metadata/structured_data/xarray_dataset/lon"
      },
      "meta": null
    },
    "precipitation": {
      "attributes": {
        "ancestors": [
          "structured_data",
          "xarray_dataset"
        ],
        "metadata": {},
        "sorting": null,
        "specs": [
          "xarray_data_var"
        ],
        "structure": {
          "chunks": [
            [
              2
            ],
            [
              2
            ],
            [
              3
            ]
          ],
          "dims": [
            "x",
            "y",
            "time"
          ],
          "resizable": false,
          "shape": [
            2,
            2,
            3
          ],
          "data_type": {
            "endianness": "little",
            "itemsize": 8,
            "kind": "f"
          }
        },
        "structure_family": "array"
      },
      "id": "precipitation",
      "links": {
        "block": "http://localhost:8000/api/v1/array/block/structured_data/xarray_dataset/precipitation?block={index_0},{index_1},{index_2}",
        "full": "http://localhost:8000/api/v1/array/full/structured_data/xarray_dataset/precipitation",
        "self": "http://localhost:8000/api/v1/metadata/structured_data/xarray_dataset/precipitation"
      },
      "meta": null
    },
    "temperature": {
      "attributes": {
        "ancestors": [
          "structured_data",
          "xarray_dataset"
        ],
        "metadata": {},
        "sorting": null,
        "specs": [
          "xarray_data_var"
        ],
        "structure": {
          "chunks": [
            [
              2
            ],
            [
              2
            ],
            [
              3
            ]
          ],
          "dims": [
            "x",
            "y",
            "time"
          ],
          "resizable": false,
          "shape": [
            2,
            2,
            3
          ],
          "data_type": {
            "endianness": "little",
            "itemsize": 8,
            "kind": "f"
          }
       },
        "structure_family": "array"
      },
      "id": "temperature",
      "links": {
        "block": "http://localhost:8000/api/v1/array/block/structured_data/xarray_dataset/temperature?block={index_0},{index_1},{index_2}",
        "full": "http://localhost:8000/api/v1/array/full/structured_data/xarray_dataset/temperature",
        "self": "http://localhost:8000/api/v1/metadata/structured_data/xarray_dataset/temperature"
      },
      "meta": null
    },
    "time": {
      "attributes": {
        "ancestors": [
          "structured_data",
          "xarray_dataset"
        ],
        "metadata": {},
        "sorting": null,
        "specs": [
          "xarray_coord"
        ],
        "structure": {
          "chunks": [
            [
              3
            ]
          ],
          "dims": [
            "time"
          ],
          "resizable": false,
          "shape": [
            3
          ],
          "data_type": {
            "endianness": "little",
            "itemsize": 8,
            "kind": "M"
          }
        },
        "structure_family": "array"
      },
      "id": "time",
      "links": {
        "block": "http://localhost:8000/api/v1/array/block/structured_data/xarray_dataset/time?block={index_0}",
        "full": "http://localhost:8000/api/v1/array/full/structured_data/xarray_dataset/time",
        "self": "http://localhost:8000/api/v1/metadata/structured_data/xarray_dataset/time"
      },
      "meta": null
    }
  },
  "count": 5
}
```

#### Special Case 2. Composite

Composite is is a specialized container-like structure designed to link together multiple tables and arrays that
store related scientific data. It does not support nesting but provides a common namespace across all columns of
the contained tables and arrays (thus, name collisions are forbidden). This allows to further abstract out
the disparate internal storage mechanisms (e.g. Parquet for tables and zarr for arrays) and present the user with a
smooth homogeneous interface for data access. Composite structures do not support pagination and are not
recommended for "wide" datasets with more than ~1000 items (columns and arrays) in the namespace.
