Module: objects/pdfobject.py
- Purpose:
This module provides the ‘PDF Document’ object structure into which PDF documents are parsed into for transport and onward use.
- Platform:
Linux/Windows | Python 3.10+
- Developer:
J Berendt
- Email:
- Comments:
n/a
- class DocPDF[source]
Bases:
_DocBaseContainer class for storing data parsed from a PDF file.
- property pages: list[PageObject]
A list of containing an object for each page in the document.
Tip
The page number index aligns to the page number in the PDF file.
For example, to access the
PageObjectfor page 42, use:pages[42]
- property parsed_using_tags: bool
Flag indicating if the document was parsed using tags.
PDF documents can be created with ‘marked content’ tags. When a PDF document is parsed using tags, as this flag indicates, the parser respects columns and other page formatting schemes. If a multi-column page is parsed without tags, the parser reads straight across the line, thus corrupting the text.