docp Library Documentation
Overview
In its simplest form, the docp project is a (doc)ument (p)arsing library.
Written in CPython, the project wraps various lower-level libraries, helping to consolidate binary document structure parsing functionality into a single library. Additional functionality includes document loaders which load a parsed document’s embeddings into a Chroma vector database, for RAG-enabled LLM use.
The Toolset
Parsers
As of this release, parsers for the following binary document types are supported:
PDF
MS PowerPoint (PPTX)
more coming soon …
Loaders
In addition to document parsing, document loading functionality is built-in as well. Specifically, loading documents into a Chroma vector database for RAG-enabled LLM ingestion.
For example, you may wish to load a series of PDF files into a vector
database which serves as the backend for a RAG-enabled LLM chatbot. The
ChromaPDFLoader and ChromaPPTXLoader classes have been specifically
designed for this. A single call to the class’ loader method results in
file retrieval, parsing, splitting, embedding and storage.
For further detail and usage examples, please refer to the documentation for the following modules:
Installation
The easiest way to install docp is using pip after activating
your virtual environment:
pip install docp
Additional (older) releases can be found either at PyPI or in GitHub Releases.
Note
To keep the installation dependencies to a minimum, only core libraries are required for installation. Meaning, the parser-specific and loader libraries are not installed automatically.
If a parser is imported and a library is required but not installed, you’ll be notified with an easy-to-read message, listing the required dependenc(y|ies).
The rationale behind this design decision is that not all users will
need the document loading capability, so torch, langchain,
etc. should not be installed automatically. For example, if your project
requires a simple PDF parser, you don’t need to (and likely don’t want to)
‘clutter’ your environment with something as heavy as torch.
Using the Library
This documentation suite contains detailed explanation and example usage for each of the library’s importable modules. For detailed documentation, usage examples and links the source code itself, please refer to the Library API Documentation page.
If there is a specific module or method which you cannot find, a search field is built into the navigation bar to the left.
Quickstart
For convenience, here are a couple examples for how to parse the supported document types.
Extract text from a PDF file:
>>> from docp import PDFParser
>>> pdf = PDFParser(path='/path/to/myfile.pdf')
>>> pdf.extract_text()
# Access the content of page 1.
>>> pg1 = pdf.doc.pages[1].content
Extracting text from a PowerPoint presentation is very similar:
>>> from docp import PPTXParser
>>> pptx = PPTXParser(path='/path/to/myfile.pptx')
>>> pptx.extract_text()
# Access the text on slide 1.
>>> pg1 = pptx.doc.slides[1].content
Troubleshooting
No guidance at this time.
If you have any questions that are not covered by this documentation, or if you spot any bugs, issues or have any recommendations, please feel free to contact us or raise an issue on GitHub.
Documentation Contents
Indices and Tables
Footnotes
Last updated: 12 Feb 2025