Module: parsers/pdfparser.py

Purpose:

This module serves as the public interface for interacting with PDF files and parsing their contents.

Platform:

Linux/Windows | Python 3.10+

Developer:

J Berendt

Email:

development@s3dev.uk

Comments:

n/a

Example:

For example code usage, please refer to the PDFParser class docstring.

class PDFParser(path: str)[source]

Bases: _PDFTableParser, _PDFTextParser

PDF document parser.

Parameters:

path (str) – Full path to the PDF document to be parsed.

Example:

Extract text from a PDF file:

>>> from docp import PDFParser

>>> pdf = PDFParser(path='/path/to/myfile.pdf')
>>> pdf.extract_text()

# Access the content of page 1.
>>> pg1 = pdf.doc.pages[1].content

Extract tables from a PDF file:

>>> from docp import PDFParser

>>> pdf = PDFParser('/path/to/myfile.pdf')
>>> pdf.extract_tables()

# Access the first table on page 1.
>>> tbl1 = pdf.doc.pages[1].tables[1]