Base (Private) Module: loaders/_chromabasepdfloader.py

Purpose:

This module provides the mid-level functionality to parse and store PDF files into a Chroma vector database.

Platform:

Linux/Windows | Python 3.10+

Developer:

J Berendt

Email:

development@s3dev.uk

Comments:

n/a

Attention

This module is not designed to be interacted with directly, only via the appropriate interface class(es).

Rather, please create an instance of a Chroma PDF document loading object using the following class:

class _ChromaBasePDFLoader(dbpath: str | ChromaDB, collection: str = None, *, split_text: bool = True, load_keywords: bool = False, llm: object = None, offline: bool = False)[source]

Bases: _ChromaBaseLoader

Base class for loading PDF documents into a Chroma vector database.

This class is a specialised version of the _ChromaBaseLoader class, designed to handle PDF presentations.

Parameters:
  • dbpath (str | ChromaDB) – Either the full path to the Chroma database directory, or an instance of a ChromaDB class. If the instance is passed, the collection argument is ignored.

  • collection (str, optional) – Name of the Chroma database collection. Only required if the db parameter is a path. Defaults to None.

  • split_text (bool, optional) – Split the document into chunks, before loading it into the database. Defaults to True.

  • load_keywords (bool, optional) – Derive keywords from the document and load these into the sister keywords collection. Defaults to False.

  • llm (object, optional) – If deriving keywords, this is the LLM which will do the derivation. Defaults to None.

  • offline (bool, optional) – Remain offline and use the locally cached embedding function model. Defaults to False.

_create_documents() bool[source]

Convert each extracted page into a Document object.

Returns:

True of the pages are loaded as Document objects successfully. Otherwise False.

Return type:

bool

_parse_text(**kwargs) bool[source]

Parse text from the document.

Keyword Arguments:

Those to be passed into the text extraction method.

Returns:

True if the parser’s ‘text’ object is populated, otherwise False.

Return type:

bool

_set_parser()[source]

Set the appropriate document parser.

Setting the parser creates a parser instance as an attribute of this class. When the parser instance is created, various file verification checks are made. For detail, refer to the following parser method: