Metadata-Version: 2.3
Name: lamindb
Version: 1.12.1
Summary: A data lakehouse for biology.
Author-email: Lamin Labs <open-source@lamin.ai>
Requires-Python: >=3.10,<3.14
Description-Content-Type: text/markdown
Classifier: Programming Language :: Python :: 3.10
Classifier: Programming Language :: Python :: 3.11
Classifier: Programming Language :: Python :: 3.12
Classifier: Programming Language :: Python :: 3.13
Requires-Dist: lamin_utils==0.15.0
Requires-Dist: lamin_cli==1.8.0
Requires-Dist: lamindb_setup[aws]==1.11.0
Requires-Dist: bionty==1.8.1
Requires-Dist: wetlab==1.6.1
Requires-Dist: nbproject==0.11.1
Requires-Dist: jupytext
Requires-Dist: nbconvert>=7.2.1
Requires-Dist: pyyaml
Requires-Dist: pyarrow
Requires-Dist: pandera>=0.24.0
Requires-Dist: typing_extensions!=4.6.0
Requires-Dist: python-dateutil
Requires-Dist: pandas>=2.0.0
Requires-Dist: anndata>=0.8.0,<=0.12.2
Requires-Dist: fsspec
Requires-Dist: graphviz
Requires-Dist: psycopg2-binary
Requires-Dist: tomlkit ; extra == "dev"
Requires-Dist: line_profiler ; extra == "dev"
Requires-Dist: pre-commit ; extra == "dev"
Requires-Dist: nox ; extra == "dev"
Requires-Dist: laminci>=0.3 ; extra == "dev"
Requires-Dist: pytest>=6.0 ; extra == "dev"
Requires-Dist: coverage ; extra == "dev"
Requires-Dist: pytest-cov<7.0.0 ; extra == "dev"
Requires-Dist: mudata ; extra == "dev"
Requires-Dist: nbproject_test>=0.6.0 ; extra == "dev"
Requires-Dist: faker-biology ; extra == "dev"
Requires-Dist: pronto ; extra == "dev"
Requires-Dist: readfcs>=2.0.1 ; extra == "fcs"
Requires-Dist: lamindb_setup[gcp] ; extra == "gcp"
Requires-Dist: numcodecs<0.16.0 ; extra == "zarr"
Requires-Dist: zarr>=2.16.0,<3.0.0a0 ; extra == "zarr"
Project-URL: Home, https://github.com/laminlabs/lamindb
Provides-Extra: dev
Provides-Extra: fcs
Provides-Extra: gcp
Provides-Extra: zarr

[![Stars](https://img.shields.io/github/stars/laminlabs/lamindb?logo=GitHub)](https://github.com/laminlabs/lamindb)
[![codecov](https://codecov.io/gh/laminlabs/lamindb/branch/main/graph/badge.svg?token=VKMRJ7OWR3)](https://codecov.io/gh/laminlabs/lamindb)
[![Docs](https://img.shields.io/badge/docs-humans-yellow)](https://docs.lamin.ai)
[![DocsLLMs](https://img.shields.io/badge/docs-LLMs-yellow)](https://docs.lamin.ai/summary.md)
[![pypi](https://img.shields.io/pypi/v/lamindb?color=blue&label=pypi%20package)](https://pypi.org/project/lamindb)
[![PyPI Downloads](https://img.shields.io/pepy/dt/lamindb?logo=pypi)](https://pepy.tech/project/lamindb)

# LaminDB - A data lakehouse for biology

LaminDB organizes datasets through validation & annotation and provides data lineage, queryability & reproducibility on top of [FAIR](https://en.wikipedia.org/wiki/FAIR_data) data.

<details>
<summary>Why?</summary>

Reproducing analytical results or understanding how a dataset or model was created can be a pain.
Let alone training models on historical data, LIMS & ELN systems, orthogonal assays, or datasets generated by other teams.
Even maintaining a mere overview of a project's or team's datasets & analyses is harder than it sounds.

Biological datasets are typically managed with versioned storage systems, GUI-focused community or SaaS platforms, structureless data lakes, rigid data warehouses (SQL, monolithic arrays), and data lakehouses for tabular data.

LaminDB extends the lakehouse architecture to biological registries & datasets beyond tables (`DataFrame`, `AnnData`, `.zarr`, `.tiledbsoma`, ...) with enough structure to enable queries and enough freedom to keep the pace of R&D high.
Moreover, it provides context through data lineage -- tracing data and code, scientists and models -- and abstractions for biological domain knowledge and experimental metadata.

</details>

**Highlights.**

- **data lineage:** track inputs & outputs of notebooks, scripts, functions & pipelines with a single line of code
- **unified access:** storage locations (local, S3, GCP, ...), SQL databases (Postgres, SQLite) & ontologies
- **lakehouse**: manage, monitor & validate features, labels & dataset schemas; distributed queries and batch loading
- **biological formats:** validate & annotate `DataFrame`, `AnnData`, `SpatialData`, ... backed by `parquet`, `zarr`, HDF5, LanceDB, ...
- **biological entities**: manage experimental metadata & ontologies based on the Django ORM
- **reproducible & auditable:** auto-version & timestamp execution reports, source code & compute environments, attribute records to users
- **zero lock-in & scalable:** runs in your infrastructure; is _not_ a client for a rate-limited REST API
- **extendable:** create custom plug-ins for your own applications based on the Django ecosystem
- **integrations:** visualization tools like [vitessce](https://docs.lamin.ai/vitessce), workflow managers like [nextflow](https://docs.lamin.ai/nextflow) & [redun](https://docs.lamin.ai/redun), and [other tools](https://docs.lamin.ai/integrations)
- **production-ready:** used in BigPharma, BioTech, hospitals & top labs

LaminDB can be connected to LaminHub to serve as a [LIMS](https://en.wikipedia.org/wiki/Laboratory_information_management_system) for wetlab scientists, closing the drylab-wetlab feedback loop: [lamin.ai](https://lamin.ai).

## Docs

Copy [summary.md](https://docs.lamin.ai/summary.md) into an LLM chat and let AI explain or read the [docs](https://docs.lamin.ai).

## Setup

<!-- copied from quick-setup-lamindb.md -->

Install the `lamindb` Python package:

```shell
pip install lamindb
```

Create a LaminDB instance:

```shell
lamin init --storage ./quickstart-data  # or s3://my-bucket, gs://my-bucket
```

Or if you have write access to an instance, connect to it:

```shell
lamin connect account/name
```

## Quickstart

<!-- copied from preface.md -->

Track a script or notebook run with source code, inputs, outputs, logs, and environment.

<!-- copied from py-quickstart.py -->

```python
import lamindb as ln

ln.track()  # track a run
open("sample.fasta", "w").write(">seq1\nACGT\n")
ln.Artifact("sample.fasta", key="sample.fasta").save()  # create an artifact
ln.finish()  # finish the run
```

<!-- from here on, slight deviation from preface.md, where all this is treated in the walk through in more depth -->

This code snippet creates an artifact, which can store a dataset or model as a file or folder in various formats.
Running the snippet as a script (`python create-fasta.py`) produces the following data lineage.

```python
artifact = ln.Artifact.get(key="sample.fasta")  # query artifact by key
artifact.view_lineage()
```

<img src="https://lamin-site-assets.s3.amazonaws.com/.lamindb/EkQATsQL5wqC95Wj0005.png" width="250">

You'll know how that artifact was created and what it's used for ([interactive visualization](https://lamin.ai/laminlabs/lamindata/artifact/8incOOgjn6F0K1TS)) in addition to capturing basic metadata:

```python
artifact.describe()
```

<img src="https://lamin-site-assets.s3.amazonaws.com/.lamindb/BOTCBgHDAvwglN3U0002.png" width="550">

You can organize datasets with validation & annotation of any kind of metadata to then access them via queries & search. Here is a more [comprehensive example](https://lamin.ai/laminlabs/lamindata/artifact/9K1dteZ6Qx0EXK8g):

<img src="https://lamin-site-assets.s3.amazonaws.com/.lamindb/6sofuDVvTANB0f480002.png" width="850">

To annotate an artifact with a label, use:

```python
my_experiment = ln.Record(name="My experiment").save()  # create a label in the universal label ontology
artifact.records.add(my_experiment)  # annotate the artifact with the label
```

To query for a set of artifacts, use the `filter()` statement.

```python
ln.Artifact.filter(records=my_experiment, suffix=".fasta").to_dataframe()  # query by suffix and the ulabel we just created
ln.Artifact.filter(transform__key="create-fasta.py").to_dataframe()  # query by the name of the script we just ran
```

If you have a structured dataset like a `DataFrame`, an `AnnData`, or another array, you can validate the content of the dataset (and parse annotations).
Here is [an example for a dataframe](https://docs.lamin.ai/tutorial#validate-an-artifact).

With a large body of validated datasets, you can then access data through distributed queries & batch streaming, see here: [docs.lamin.ai/arrays](https://docs.lamin.ai/arrays).

