Base (Private) Module: parsers/_pdftextparser.py

Purpose:

This module provides the logic for parsing text from a PDF document.

Platform:

Linux/Windows | Python 3.10+

Developer:

J Berendt

Email:

development@s3dev.uk

Attention

This module is not designed to be interacted with directly, only via the appropriate interface class(es).

Rather, please create an instance of a PDF document parsing object using the following:

Note

Multi-processing

Text extraction through multi-processing has been tested and is not feasible due to an error indicating the pdfplumber.page.Page object can not be pickled. This object was being passed into the extraction method as the object contains the extract_text() function.

Additionally, multi-threading has also been tested and it was determined to be too complex and inefficient. This was tested using the concurrent.futures.ThreadPoolExecutor class and two documents, 14 and 92 pages; the timings are shown below. The multi-threaded approach took longer to process and added unnecessary complexity to the code base. As a side-effect, the pages are processed and stored out of order which would require a re-order, adding more complexity.

It has therefore been determined that this module will remain single-threaded.

Multi-Thread Timings

  • Single-threaded:

    • 14 page document: ~2 seconds

    • 92 page document: ~32 seconds

  • Multi-threaded:

    • 14 page document: ~2 seconds

    • 92 page document: ~35 seconds

class _PDFTextParser(path: str)[source]

Bases: _PDFBaseParser

Private PDF document text parser intermediate class.

Parameters:

path (str) – Full path to the PDF document.

Example:

Extract text from a PDF file:

>>> from docp import PDFParser

>>> pdf = PDFParser(path='/path/to/myfile.pdf')
>>> pdf.extract_text()

# Access the content of page 1.
>>> pg1 = pdf.doc.pages[1].content
extract_text(*, remove_header: bool = False, remove_footer: bool = False, remove_newlines: bool = False, ignore_tags: set = None, convert_to_ascii: bool = True, **kwargs)[source]

Extract text from the document.

If the PDF document contains ‘marked content’ tags, these tags are used to extract the text as this is a more accurate approach and respects the structure of the page(s). Otherwise, a bounding box method is used to extract the text. If instructed, the header and/or footer regions can be excluded.

A list of pages, with extracted content can be accessed using the self.doc.pages attribute.

Parameters:
  • remove_header (bool, optional) – If True, the header is cropped (skipped) from text extraction. This only applies to the bounding box extraction method. Defaults to False.

  • remove_footer (bool, optional) – If True, the footer is cropped (skipped) from text extraction. This only applies to the bounding box extraction method. Defaults to False.

  • remove_newlines (bool, optional) – If True, the newline characters are replaced with a space. Defaults to False.

  • ignore_tags (set, optional) – If provided, these are the PDF ‘marked content’ tags which will be ignored. Note that the PDF document must contain tags, otherwise the bounding box method is used and this argument is ignored. Defaults to {'Artifact'}, as these generally relate to a header and/or footer. To include all tags, (not skip any) pass this argument as 'na'.

  • convert_to_ascii (bool, optional) – When a non-ASCII character is found, an attempt is made to convert it to an associated ASCII character. If a character cannot be converted, it is replaced with a '?'. Defaults to True.

Keyword Args:
  • None

Returns:

None.

_extract_text_using_bbox(**kwargs)[source]

Extract text using a bbox for finding the header and footer.

Keyword Arguments:

Those passed by the caller, extract_text().

_extract_text_using_tags(**kwargs)[source]

Extract text using tags.

The tags defined by the ignore_tags are skipped.

Keyword Arguments:

Those passed by the caller, extract_text().

static _text_from_tags(page: pdfplumber.page.Page, ignored: set) str[source]

Generate a page of text extracted from tags.

When extracting text from tags, newlines are not encoded and must be derived. For each character on the page, the top and bottom coordinates are compared to determine when a newline should be inserted. If both the top and bottom of the current character are greater than the previous character, a newline is inserted into the text stream.

Parameters:
  • page (pdfplumber.page.Page) – Page to be parsed.

  • ignored (set) – A set containing the tags to be ignored.

Yields:

str – Each character on the page, providing its tag is not to be ignored. Or, a newline character if the current character’s coordinates are greater than (lower on the page) than the previous character.

_uses_marked_content() bool[source]

Test whether the document can be parsed using tags.

Marked content allows us to parse the PDF using tags (rather than OCR) which is more accurate not only in terms of character recognition, but also with regard to the structure of the text on a page.

Logic:

If the document’s catalog shows Marked: True, then True is returned immediately.

Otherwise, a second attempt is made which detects marked content tags on the first three pages. If no tags are found, a third attempt is made by searching the first 10 pages. If tags are found during either of these attempts, True is returned immediately.

Finally, if no marked content or tags were found, False is returned.

Returns:

Returns True if the document can be parsed using marked content tags, otherwise False.

Return type:

bool