Last updated:
0 purchases
PDFContentConverter 0.7
The PDF Content Converter is a tool for converting PDF text as well as structural features into a pandas dataframe, written natively in Python.
It retrieves information about textual content, fonts, positions, character frequencies and surrounding visual PDF elements.
How-to
Pass the path of the PDF file which is wanted to be converted to PDFContentConverter.
Call the function pdf2pandas(). The PDF content is then returned as a pandas dataframe.
Media boxes of a PDF can be accessed using get_media_boxes(), the page count over get_page_count() and the document text using pdf2text().
Using the convert() function, the pandas dataframe, textual document content, media boxes and page count are returned as a dictionary.
Example call:
converter = PDFContentConverter(pdf)
result = converter.pdf2pandas()
Output Format
The output containing the converted PDF data is stored as pandas dataframe.
The different PDF elements are stored as rows.
The dataframe contains the following columns:
id: unique identifier of the PDF element
page: page number, starting with 0
text: text of the PDF element
x_0: left x coordinate
x_1: right x coordinate
y_0: top y coordinate
y_1: bottom y coordinate
pos_x: center x coordinate
pos_y: center y coordinate
abs_pos: tuple containing a page independent representation of (pos_x,pos_y) coordinates
original_font: font as extracted by pdfminer
font_name: name of the font extracted from original_font
code: font code as provided by pdfminer
bold: factor 1 indicating that a text is bold and 0 otherwise
italic: factor 1 indicating that a text is italic and 0 otherwise
font_size: size of the text in points
masked: text with numeric content substituted as #
frequency_hist: histogram of character type frequencies in a text, stored as a tuple containing percentages of textual, numerical, text symbolic and other symbols
len_text: number of characters
n_tokens: number of words
tag: tag for key-value pair extractions, indicating keys or values based on simple heuristics
box: box extracted by pdfminer Layout Analysis
in_element_ids: contains IDs of surrounding visual elements such as rectangles or lists. They are stored as a list [left, right, top, bottom]. -1 is indicating that there is no adjacent visual element.
in_element: indicates based on in*element_ids whether an element is stored in a visual rectangle representation (stored as “rectangle”) or not (stored as “none”).
Additionally, a dictionary is returned containing the following entries,
which can be used to transform the absolute CSV coordinates:
x0: Left x page crop box coordinate
x1: Right x page crop box coordinate
y0: Top y page crop box coordinate
y1: Bottom y page crop box coordinate
x0page: Left x page coordinate
x1page: Right x page coordinate
y0page: Top y page coordinate
y1page: Bottom y page coordinate
Both are returned in a dictionary when using convert().
The dataframe is stored as “content”, the page characteristics as “media_boxes”, the textual content as “text” and the number of pages as “page_count”.
Acknowledgements
This work is built on top of the pdfminer project https://github.com/euske/pdfminer.
For personal and professional use. You cannot resell or redistribute these repositories in their original state.
There are no reviews.