Module: loaders/chromapptxloader.py
- Purpose:
This module provides the entry point for loading PPTX files into a Chroma database.
- Platform:
Linux/Windows | Python 3.10+
- Developer:
J Berendt
- Email:
- Comments:
n/a
- Examples:
Parse and load a single PPTX file into a Chroma database collection:
>>> from docp.loaders import ChromaPPTXLoader >>> l = ChromaPPTXLoader(dbpath='/path/to/chroma', collection='spam', split_text=False) >>> l.load(path='/path/to/directory/myfile.pptx')
Parse and load a directory of PPTX files into a Chroma database collection:
>>> from docp.loaders import ChromaPPTXLoader >>> l = ChromaPPTXLoader(dbpath='/path/to/chroma', collection='spam', split_text=False) >>> l.load(path='/path/to/directory', ext='pptx')
For further example code use, please refer to the
ChromaPPTXLoaderclass docstring.
- class ChromaPPTXLoader(dbpath: str | ChromaDB, collection: str = None, *, split_text: bool = True, load_keywords: bool = False, llm: object = None, offline: bool = False)[source]
Bases:
_ChromaBasePPTXLoaderChroma database PPTX-specific document loader.
- Parameters:
dbpath (str | ChromaDB) – Either the full path to the Chroma database directory, or an instance of a
ChromaDBclass. If the instance is passed, thecollectionargument is ignored.collection (str, optional) – Name of the Chroma database collection. Only required if the
dbparameter is a path. Defaults to None.split_text (bool, optional) – Split the document into chunks, before loading it into the database. Defaults to True.
load_keywords (bool, optional) – Derive keywords from the document and load these into the sister keywords collection. Defaults to False.
llm (object, optional) – If deriving keywords, this is the LLM which will do the derivation. Defaults to None.
offline (bool, optional) – Remain offline and use the locally cached embedding function model. Defaults to False.
Important
The deriving and loading of keywords is only recommended for GPU-bound processing, as the LLM is invoked to infer the keywords for each given document.
If called on a ‘standard’ PC, this will take a long time to complete, if it completes at all.
Tip
It is recommended to pass
split_text=Falseinto theChromaPPTXLoaderconstructor.Often, PowerPoint presentations are structured such that related text is found in the same ‘shape’ (textbox) on a slide. Splitting the text in these shapes may have undesired results.
- Examples:
Parse and load a single PPTX file into a Chroma database collection:
>>> from docp.loaders import ChromaPPTXLoader >>> l = ChromaPPTXLoader(dbpath='/path/to/chroma', collection='spam', split_text=False) # <-- Note this >>> l.load(path='/path/to/directory/myfile.pptx')
Parse and load a directory of PPTX files into a Chroma database collection:
>>> from docp.loaders import ChromaPPTXLoader >>> l = ChromaPPTXLoader(dbpath='/path/to/chroma', collection='spam', split_text=False) # <-- Note this >>> l.load(path='/path/to/directory', ext='pptx')
- load(path: str, *, ext: str = '**', recursive: bool = True, remove_newlines: bool = True, convert_to_ascii: bool = True, **unused) None[source]
Load a PDF file (or files) into a Chroma database.
- Parameters:
path (str) – Full path to the file (or directory) to be parsed and loaded. Note: If this is a directory, a specific file extension can be passed into the
load()method using theextargument.ext (str, optional) –
If the
pathargument refers to a directory, a specific file extension can be specified here. For example:ext = 'pptx'.If anything other than
'**'is provided, all alpha-characters are parsed from the string, and prefixed with*.. Meaning, if'.pptx'is passed, the characters'pptx'are parsed and prefixed with*.to create'*.pptx'. However, if'things.foo'is passed, the derived extension will be'*.thingsfoo'. Defaults to ‘**’, for a recursive search.recursive (bool, optional) – If True, subdirectories are searched. Defaults to True.
remove_newlines (bool, optional) – Replace newline characters with a space. Defaults to True, as this helps with document chunk splitting.
convert_to_ascii (bool, optional) – Convert all characters to ASCII. Defaults to True.
- Keyword Args:
- unused (dict): This enables keywords such as
remove_header and
remove_footer(for example) to be passed into a loader-agnostic.load()function without raising a ‘unexpected keyword argument`TypeError.
- unused (dict): This enables keywords such as
- static _set_kwargs(locals_: dict) dict[source]
Prepare the arguments which are sent to the doc parser.
As
locals()is used to capture theload()method’s arguments for passing into the doc parser, some argument must be removed first.- Parameters:
locals_ (dict) – The return value from a
locals()call.- Returns:
A copy of the provided dictionary with specific key/value pairs removed.
- Return type:
dict