Base (Private) Module: loaders/_chromabaseloader.py
- Purpose:
This module provides the base functionality for parsing and storing a document’s data into a Chroma vector database.
- Platform:
Linux/Windows | Python 3.10+
- Developer:
J Berendt
- Email:
- Comments:
n/a
Attention
This module is not designed to be interacted with directly, only via the appropriate interface class(es).
Rather, please create an instance of a Chroma document-type-specific loader object using one of the following classes:
- class _ChromaBaseLoader(dbpath: str | ChromaDB, collection: str = None, *, split_text: bool = True, load_keywords: bool = False, llm: object = None, offline: bool = False)[source]
Bases:
objectBase class for loading documents into a Chroma vector database.
- Parameters:
dbpath (str | ChromaDB) – Either the full path to the Chroma database directory, or an instance of a
ChromaDBclass. If the instance is passed, thecollectionargument is ignored.collection (str, optional) – Name of the Chroma database collection. Only required if the
dbparameter is a path. Defaults to None.split_text (bool, optional) – Split the document into chunks, before loading it into the database. Defaults to True.
load_keywords (bool, optional) – Derive keywords from the document and load these into the sister keywords collection. Defaults to False.
llm (object, optional) – If deriving keywords, this is the LLM which will do the derivation. Defaults to None.
offline (bool, optional) – Remain offline and use the locally cached embedding function model. Defaults to False.
- property chroma
Accessor to the database client object.
- property parser
Accessor to the document parser object.
- _already_loaded() bool[source]
Test if the file has already been loaded into the collection.
- Logic:
This test is performed by querying the collection for a metadata ‘source’ which equals the filename. As this uses a chromadb ‘filter’ (i.e.
$eq), testing for partial matches is not possible at this time.If the filename is different (in any way) from the source’s filename in the database, the file will be loaded again.
- Returns:
True is the exact filename was found in the collection’s metadata, otherwise False.
- Return type:
bool
- _check_parameters() None[source]
Verify the class parameters are viable.
- Raises:
ValueError – If the
load_keywordsargument is True and thellmargument is None, or the inverse. Both arguments must either sum to 0, or 2.
- _load(path: str, **kwargs)[source]
Load the provided file into the vector store.
- Parameters:
path (str) – Full path to the file to be loaded.
- Keyword Arguments:
Those passed from the document-type-specific loader’s
load()method.
- _load_worker() bool[source]
Load the split documents into the database collection.
- Returns:
True if loaded successfully, otherwise False. Success is based on the number of records after the load being greater than the number of records before the load, or not exceptions being raised.
- Return type:
bool
- static _print_summary(success: bool)[source]
Print an end of processing summary.
- Parameters:
success (bool) – Success flag from the processor.
- _set_db_client() bool[source]
Set the database client object.
If the
_dbobject is a string, this is inferred as the path to the database. Otherwise, it is inferred as the database object itself.- Returns:
True if the database object is set without error. Otherwise False.
- Return type:
bool
- _set_text_splitter() bool[source]
Define the text splitter to be used.
- Returns:
True, always.
- Return type:
bool
- _split_texts() bool[source]
Split the document text using a recursive text splitter.
Note
If the
split_textparameter was passed asFalseon instantiation, the texts will not be split. Rather, the_docslist is simply copied to the_docssattribute.- Returns:
True if the text was split (or copied) successfully, otherwise False.
- Return type:
bool
- _store_keywords(kwds: str) bool[source]
Store the extracted keywords into the keywords collection.
- Parameters:
kwds (str) – A string containing the keywords extracted from the document.
- Returns:
True if loaded successfully, otherwise False.
- Return type:
bool
- _test_load(nrecs_b: int, nrecs_a: int) bool[source]
Test the document was loaded successfully.
- Test:
Given a count of records before the load, verify the number of records after the load is equal to the number of records before, plus the number of split documents.
- Parameters:
nrecs_b (int) – Number of records before the load.
nrecs_a (int) – Number of records after the load.
- Returns:
True if the number of records before the load plus the number is splits is equal to the number of records after the load.
- Return type:
bool