Module: dbs/chroma.py

Purpose:

This module provides a localised wrapper and specialised functionality around the langchain_community.vectorstores.Chroma class, for interacting with a Chroma database.

Platform:

Linux/Windows | Python 3.10+

Developer:

J Berendt

Email:

development@s3dev.uk

Comments:

This module uses the langchain_community.vectorstores.Chroma wrapper class, rather than the base chromadb library as it provides the add_texts method which supports GPU processing and parallelisation; which is implemented by this module’s add_documents() method.

class ChromaDB(*args: Any, **kwargs: Any)[source]

Bases: Chroma

Wrapper class around the chromadb library.

Parameters:
  • path (str) – Path to the chroma database’s directory.

  • collection (str) – Collection name.

  • offline (bool, optional) – Remain offline, used the cached embedding function model rather than obtaining one online. Defaults to False.

property client

Accessor to the chromadb.PersistentClient class.

property collection

Accessor to the chromadb client’s collection object.

property embedding_function

Accessor to the embedding function used.

property path: str

Accessor to the database’s path.

add_documents(docs: list[langchain_core.documents.base.Document])[source]

Add multiple documents to the collection.

This method overrides the base class’ add_documents method to enable local ID derivation. Knowing how the IDs are derived gives us greater understanding and querying ability of the documents in the database. Each ID is derived locally by the _preproc() method from the file’s basename, page number and page content.

Additionally, this method wraps the langchain_community.vectorstores.Chroma.add_texts() method which supports GPU processing and parallelisation.

Parameters:

docs (list) – A list of langchain_core.documents.base.Document document objects.

show_all()[source]

Return the entire contents of the collection.

This is an alias around .collection.get().

_get_embedding_function_model() str[source]

Derive the path to the embedding function model.

Note:

If offline=True was passed into the class constructor, the model cache is used, if available - otherwise the user is warned.

If online usage is allowed, the model is obtained by the means defined by the embedding function constructor.

Returns:

The name of the model. Or, if offline, the path to the model’s cache to be passed into the embedding function constructor is returned.

Return type:

str

static _preproc(docs: list)[source]

Pre-process the document objects to create the IDs.

Parse the Document object into its parts for storage. Additionally, create the ID as a hash of the source document’s basename, page number and content.

_set_client()[source]

Set the database client object.

_set_collection()[source]

Set the database collection object.

_set_embedding_fn()[source]

Set the embeddings function object.