Base (Private) Module: parsers/_pptxtextparser.py

Purpose:

This module provides the logic for parsing text from a PPTX document.

Platform:

Linux/Windows | Python 3.10+

Developer:

J Berendt

Email:

development@s3dev.uk

Attention

This module is not designed to be interacted with directly, only via the appropriate interface class(es).

Rather, please create an instance of a PPTX document parsing object using the following:

class _PPTXTextParser(path: str)[source]

Bases: _PPTXBaseParser

Private PPTX document text parser intermediate class.

Parameters:

path (str) – Full path to the PPTX document.

Example:

Extract text from a PPTX file:

>>> from docp import PPTXParser

>>> pptx = PPTXParser(path='/path/to/myfile.pptx')
>>> pptx.extract_text()

# Access the text on slide 1.
>>> pg1 = pptx.doc.slides[1].content
extract_text(*, remove_newlines: bool = False, convert_to_ascii: bool = True, **kwargs) None[source]

Extract text from the document.

A list of slides, with extracted content can be accessed using the self.doc.slides attribute.

Parameters:
  • remove_newlines (bool, optional) – If True, the newline characters are replaced with a space. Defaults to False.

  • convert_to_ascii (bool, optional) – When a non-ASCII character is found, an attempt is made to convert it to an associated ASCII character. If a character cannot be converted, it is replaced with a '?'. Defaults to True.

Keyword Args:
  • None

Returns:

None.

_extract_text(remove_newlines: bool, convert_to_ascii: bool) None[source]

Extract the text from all shapes on all slides.

Parameters:
  • remove_newlines (bool) – Replace the newline characters with a space.

  • convert_to_ascii (bool) – Attempt to convert any non-ASCII characters to their ASCII equivalent.

The text extracted from each slide is stored as a TextObject which is appended to the slide’s texts attribute.