Base (Private) Module: loaders/_chromabaseloader.py

Purpose:

This module provides the base functionality for parsing and storing a document’s data into a Chroma vector database.

Platform:

Linux/Windows | Python 3.10+

Developer:

J Berendt

Email:

development@s3dev.uk

Comments:

n/a

Attention

This module is not designed to be interacted with directly, only via the appropriate interface class(es).

Rather, please create an instance of a Chroma document-type-specific loader object using one of the following classes:

class _ChromaBaseLoader(dbpath: str | ChromaDB, collection: str = None, *, split_text: bool = True, load_keywords: bool = False, llm: object = None, offline: bool = False)[source]

Bases: object

Base class for loading documents into a Chroma vector database.

Parameters:
  • dbpath (str | ChromaDB) – Either the full path to the Chroma database directory, or an instance of a ChromaDB class. If the instance is passed, the collection argument is ignored.

  • collection (str, optional) – Name of the Chroma database collection. Only required if the db parameter is a path. Defaults to None.

  • split_text (bool, optional) – Split the document into chunks, before loading it into the database. Defaults to True.

  • load_keywords (bool, optional) – Derive keywords from the document and load these into the sister keywords collection. Defaults to False.

  • llm (object, optional) – If deriving keywords, this is the LLM which will do the derivation. Defaults to None.

  • offline (bool, optional) – Remain offline and use the locally cached embedding function model. Defaults to False.

property chroma

Accessor to the database client object.

property parser

Accessor to the document parser object.

_already_loaded() bool[source]

Test if the file has already been loaded into the collection.

Logic:

This test is performed by querying the collection for a metadata ‘source’ which equals the filename. As this uses a chromadb ‘filter’ (i.e. $eq), testing for partial matches is not possible at this time.

If the filename is different (in any way) from the source’s filename in the database, the file will be loaded again.

Returns:

True is the exact filename was found in the collection’s metadata, otherwise False.

Return type:

bool

_check_parameters() None[source]

Verify the class parameters are viable.

Raises:

ValueError – If the load_keywords argument is True and the llm argument is None, or the inverse. Both arguments must either sum to 0, or 2.

_create_documents() bool[source]

Stub method; overridden by the child class.

_get_keywords() str[source]

Query the document (using the LLM) to extract the keywords.

_load(path: str, **kwargs)[source]

Load the provided file into the vector store.

Parameters:

path (str) – Full path to the file to be loaded.

Keyword Arguments:

Those passed from the document-type-specific loader’s load() method.

_load_worker() bool[source]

Load the split documents into the database collection.

Returns:

True if loaded successfully, otherwise False. Success is based on the number of records after the load being greater than the number of records before the load, or not exceptions being raised.

Return type:

bool

_parse_text(**kwargs) bool[source]

Stub method, overridden by the child class.

static _print_summary(success: bool)[source]

Print an end of processing summary.

Parameters:

success (bool) – Success flag from the processor.

_set_db_client() bool[source]

Set the database client object.

If the _db object is a string, this is inferred as the path to the database. Otherwise, it is inferred as the database object itself.

Returns:

True if the database object is set without error. Otherwise False.

Return type:

bool

_set_parser()[source]

Stub method, overridden by the child class.

_set_text_splitter() bool[source]

Define the text splitter to be used.

Returns:

True, always.

Return type:

bool

_split_texts() bool[source]

Split the document text using a recursive text splitter.

Note

If the split_text parameter was passed as False on instantiation, the texts will not be split. Rather, the _docs list is simply copied to the _docss attribute.

Returns:

True if the text was split (or copied) successfully, otherwise False.

Return type:

bool

_store_keywords(kwds: str) bool[source]

Store the extracted keywords into the keywords collection.

Parameters:

kwds (str) – A string containing the keywords extracted from the document.

Returns:

True if loaded successfully, otherwise False.

Return type:

bool

_test_load(nrecs_b: int, nrecs_a: int) bool[source]

Test the document was loaded successfully.

Test:
  • Given a count of records before the load, verify the number of records after the load is equal to the number of records before, plus the number of split documents.

Parameters:
  • nrecs_b (int) – Number of records before the load.

  • nrecs_a (int) – Number of records after the load.

Returns:

True if the number of records before the load plus the number is splits is equal to the number of records after the load.

Return type:

bool