Base (Private) Module: parsers/_pptxtextparser.py
- Purpose:
This module provides the logic for parsing text from a PPTX document.
- Platform:
Linux/Windows | Python 3.10+
- Developer:
J Berendt
- Email:
Attention
This module is not designed to be interacted with directly, only via the appropriate interface class(es).
Rather, please create an instance of a PPTX document parsing object using the following:
- class _PPTXTextParser(path: str)[source]
Bases:
_PPTXBaseParserPrivate PPTX document text parser intermediate class.
- Parameters:
path (str) – Full path to the PPTX document.
- Example:
Extract text from a PPTX file:
>>> from docp import PPTXParser >>> pptx = PPTXParser(path='/path/to/myfile.pptx') >>> pptx.extract_text() # Access the text on slide 1. >>> pg1 = pptx.doc.slides[1].content
- extract_text(*, remove_newlines: bool = False, convert_to_ascii: bool = True, **kwargs) None[source]
Extract text from the document.
A list of slides, with extracted content can be accessed using the
self.doc.slidesattribute.- Parameters:
remove_newlines (bool, optional) – If True, the newline characters are replaced with a space. Defaults to False.
convert_to_ascii (bool, optional) – When a non-ASCII character is found, an attempt is made to convert it to an associated ASCII character. If a character cannot be converted, it is replaced with a
'?'. Defaults to True.
- Keyword Args:
None
- Returns:
None.
- _extract_text(remove_newlines: bool, convert_to_ascii: bool) None[source]
Extract the text from all shapes on all slides.
- Parameters:
remove_newlines (bool) – Replace the newline characters with a space.
convert_to_ascii (bool) – Attempt to convert any non-ASCII characters to their ASCII equivalent.
The text extracted from each slide is stored as a
TextObjectwhich is appended to the slide’stextsattribute.