Base (Private) Module: parsers/_pdftableparser.py

Purpose:

This module provides the logic for parsing tables from a PDF document.

Platform:

Linux

Developer:

J Berendt

Email:

jeremy.berendt@rolls-royce.com

Attention

This module is not designed to be interacted with directly, only via the appropriate interface class(es).

Rather, please create an instance of a PDF document parsing object using the following:

class _PDFTableParser(path: str)[source]

Bases: _PDFBaseParser

Private PDF document table parser intermediate class.

Parameters:

path (str) – Full path to the PDF document.

Example:

Extract tables from a PDF file:

>>> from docp import PDFParser

>>> pdf = PDFParser(path='/path/to/myfile.pdf')
>>> pdf.extract_tables()

>>> tables = pdf.doc.tables
extract_tables(table_settings: dict = None, as_dataframe: bool = False, to_csv: bool = True, verbose: bool = False)[source]

Extract tables from the document.

Before a table is extracted, a number of validation tests are performed to verify what has been identified as a ‘table’ is actually a table which might be useful to the user.

Each ‘valid’ table is written as a CSV file on the user’s desktop.

Additionally, the extracted table data is stored to the class’ self.tables attribute.

Parameters:
  • table_settings (dict, optional) – Table settings to be used for the table extraction. Defaults to None, which is replaced by the value in the config.

  • as_dataframe (bool, optional) – By default, the extracted tables are returned as a list of (lists of lists), for example: all_tables[table[rows[data]]]. However, if this argument is True, the table data is returned as a list of pandas.DataFrame objects. In this case, the first row of the table is used as the header, and all remaining rows are treated as data. Note: This will not work properly for all tables. Defaults to False.

  • to_csv (bool, optional) – Dump extracted table data to a CSV file, one per table. Defaults to True.

  • verbose (bool, optional) – Display how many tables were extracted, and the path to their location.

_create_table_directory_path()[source]

Create the output directory for table data.

If the directory does not exist, it is created.

_create_table_file_path(pageno: int, tblno: int) str[source]

Create the filename for the table.

Parameters:
  • pageno (int) – Page from which the table was extracted.

  • tblno (int) – Number of the table on the page, starting at 1.

Returns:

Explicit path to the file to be written.

Return type:

str

static _filter_tables(tables: list, threshold: int = 5000) list[source]

Remove tables from the passed list which are deemed invalid.

Parameters:
  • tables (list) – A list of tables as detected by the Page.find_table() method.

  • threshold (int, optional) – Minimum pixel area for a detected table to be returned. Defaults to 5000.

Rationale:

An ‘invalid’ table is determined by the number of pixels which the table covered. Any table which is less than (N) pixels is likely a block of text which has been categorised as a ‘table’, but is not.

Returns:

A list of tables whose pixel area is greater than threshold.

Return type:

list

Verify a table is not a header or footer.

Parameters:

table (list[list]) – Table (a list of lists) be a analysed.

Rationale:

A table is determined to be a header or footer if any of the line contained in the ‘common lines list’ are found in the table.

If any of these lines are found, the table is determined to be a header/footer, True is returned.

Returns:

False if the table is not a header/footer, otherwise True.

Return type:

bool

_to_buffer(data: list[list]) StringIO[source]

Write the table data into a string buffer.

Parameters:

data (list[list]) – The table data as a list of lists to be written to a buffer.

Returns:

A string buffer as an io.StringIO object.

Return type:

io.StringIO

_to_csv(buffer: StringIO, pageno: int, tableno: int) int[source]

Write a table (from the buffer) to CSV.

Parameters:
  • buffer (io.StringIO) – A pre-processed StringIO object containing table data to be written.

  • pageno (int) – Page number from the Page object.

  • tableno (int) – Number of the table on the page, based at 1.

Returns:

1 if the file was written, otherwise 0. This is used by the caller to track the number of CSV files written.

Return type:

int

_to_df(buffer: StringIO)[source]

Write a table (from the buffer) to a DataFrame.

Once written, the DataFrame is appended to self._doc._tables list of tables.

Parameters:

buffer (io.StringIO) – A pre-processed StringIO object containing table data to be written.