PDF Documents

Actions

There are multiple actions provided by Aiviro for working with PDF documents.

Cut

class aiviro.actions.documents.pdf.cut.PDFFindAndCutPages

Cuts the pdf file to the maximum valid page found by searching for the constant texts (e.g. page 1/2) to avoid parsing unrelated and unnecessary info.

Returns:: Path to the cut PDF file - either original path or a new one with cut pages.
Example:

>>> from aiviro.actions.documents.pdf import PDFFindAndCutPages
>>> from aiviro.modules.pdf import create_pdf_robot
>>> r = create_pdf_robot()
>>> r.parse(pdf_file="path/to/file.pdf")
>>> cut_pdf_path = PDFFindAndCutPages()(r)

Extract

These actions allow to obtain information from the pdf file.

class aiviro.actions.documents.pdf.extract.PDFExtractTextBoxes(force_ocr: bool = False)

Extracts bounding boxes with texts from the provided pdf file. By default it first tries digital extraction and then OCR. If force OCR is set to True, only OCR extraction is used.

Parameters:: force_ocr – If True, forces OCR extraction only.
Returns:: A dictionary of page index with a list of its extracted bounding boxes.
Example:

>>> from aiviro.actions.documents.pdf import PDFExtractTextBoxes
>>> from aiviro.modules.pdf import create_pdf_robot
>>> r = create_pdf_robot()
>>> r.parse(pdf_file="path/to/file.pdf")
>>> # this action will try digital extraction first and OCR as a fallback
>>> page_boxes = PDFExtractTextBoxes()(r)

>>> from aiviro.actions.documents.pdf import PDFExtractTextBoxes
>>> from aiviro.modules.pdf import create_pdf_robot
>>> # we want to process pdf via OCR
>>> r = create_pdf_robot(pdf_path="path/to/file.pdf")
>>> page_boxes = PDFExtractTextBoxes(force_ocr=True)(r)

class aiviro.actions.documents.pdf.extract.PDFExtractComponents(pdf_path: Path | str)

Returns texts, comments, file attachments, images and number of pages from the provided pdf file.

Parameters:: pdf_path – Path to the pdf file.
Returns:: Extracted data of type PDFExtractComponents.
Example:

>>> from aiviro.modules.pdf import create_pdf_robot
>>> from aiviro.actions.documents.pdf import PDFExtractComponents
>>> # we want to see for example comments and attachments from the provided pdf file
>>> r = create_pdf_robot()
>>> extracted_data = PDFExtractComponents(pdf_path="path/to/file.pdf")(r)
>>> print(extracted_data.comments, extracted_data.attachments)

class aiviro.actions.documents.pdf.extract.PDFConvertToImage(pdf_path: Path | str, pdf_config: PDFConfig | None = None)

Returns images of the pdf pages of the provided pdf file.

Parameters:

pdf_path – Path to the pdf file.
pdf_config – Optional config used for the pdf file processing.

Returns:

List containing numpy.ndarray images converted from the pdf pages.

Example:

>>> from aiviro.modules.pdf import create_pdf_robot
>>> from aiviro.actions.documents.pdf import PDFConvertToImage
>>> r = create_pdf_robot()
>>> pages = PDFConvertToImage(pdf_path="path/to/file.pdf")(r)

Modify

Adds an image to the provided PDF file on the required page.

Parameters:

pdf_path – Path to the PDF file.
image_path – Path to the image file which should be added to the PDF.
coordinates – Coordinates where to add the image (x, y).
width – Width to which the image should be resized.
height – Height to which the image should be resized.
page_index – Which page to add the image to.
preserve_ratio – Option whether to keep aspect ratio of the image. If the width and height are not None, the method creates a frame of the given size and adds the image to it’s center while maintaining the aspect ratio.
output_path – Optional path where to save the modified PDF file (if not provided, the original file will be overwritten). If path refers to a directory, the modified PDF will be saved to the directory, otherwise it will be saved to the provided filepath.
pdf_config – Optional config used for the pdf file processing.

Returns:

Path to the modified PDF file.

Raises:

PDFImageAdditionError – If there was an error during the image addition.

Example:

>>> from aiviro.actions.documents.pdf import PDFAddImage
>>> from aiviro.modules.pdf import create_pdf_robot
>>> r = create_pdf_robot()
>>> # adds image on the provided coordinates into the supplied pdf
>>> modified_pdf = PDFAddImage(
>>>    pdf_path="path/to/file.pdf",
>>>    image_path="path/to/img.png",
>>>    coordinates=(150, 150),
>>>    preserve_ratio=True,
>>> )(robot=r)

class aiviro.actions.documents.pdf.modify.PDFAddText(pdf_path: str | Path, text_to_add: str, coordinates: tuple[int, int], font: PDFTextFonts = PDFTextFonts.TIMES_ROMAN, font_size: int = 12, page_index: int = 0, output_path: str | Path | None = None, pdf_config: PDFConfig | None = None)

Adds text to the provided PDF file on the required page.

Parameters:

pdf_path – Path to the PDF file.
text_to_add – Text that should be added to the PDF file.
coordinates – Coordinates where to add the text (x, y).
font – Which font to use.
font_size – How large should the text be.
page_index – Which page to add the text to.
output_path – Optional path where to save the modified PDF file (if not provided, a new filename is generated). If path refers to a directory, the modified PDF will be saved to the directory, otherwise it will be saved to the provided filepath.
pdf_config – Optional config used for the pdf file processing.

Returns:

Path to the modified PDF file.

Example:

>>> from aiviro.actions.documents.pdf import PDFAddText
>>> from aiviro.modules.pdf import create_pdf_robot
>>> # add text to the top left corner of the second page
>>> r = create_pdf_robot()
>>> modified_pdf = PDFAddText("path_to_file.pdf", "My Special Text", (20, 20), page_index=1)(r)

>>> from aiviro.actions.documents.pdf import PDFAddText
>>> from aiviro.modules.pdf import create_pdf_robot
>>> # add text to middle on the right of the first page with Helvetica font of size 30
>>> r = create_pdf_robot()
>>> modified_pdf = PDFAddText(
>>>     pdf_path="another.pdf",
>>>     text_to_add="TimeStamp",
>>>     x=400,
>>>     y=400,
>>>     font="Helvetica",
>>>     font_size=30,
>>>     page_index=0
>>> )(r)

class aiviro.actions.documents.pdf.modify.PDFMergeFiles(files: list[str | Path])

Merges the given PDF & Image files into a single PDF file. In the order of the list.

Parameters:: files – List of PDF and Image files which should be merged. Supported file formats are PDF, PNG, JPG and JPEG.
Raises:: ValueError – If the list of files is empty or an unsupported file format is given.
Returns:: Path to the merged PDF file, the path is temporary and will be deleted after the robot is closed.
Example:

>>> from aiviro.actions.documents.pdf import PDFMergeFiles
>>> from aiviro.modules.pdf import create_pdf_robot
>>> r = create_pdf_robot()
>>> # merge multiple files into a single PDF file
>>> merged_pdf_path = PDFMergeFiles(
>>>    files=["path/to/file_1.jpg", "path/to/file_2.pdf", "path/to/file_3.png", "path/to/file_4.pdf"]
>>> )(robot=r)

Rotate

class aiviro.actions.documents.pdf.rotate.PDFAutoRotatePages(pdf_path: str | Path, pdf_config: PDFConfig | None = None)

Provides auto rotation for pages of the pdf files. Works only with the non-digital files.

Parameters:

pdf_path – Path to the pdf file.
pdf_config – Optional config used for the pdf file processing.

Returns:

List of images with the rotated pdf pages.

Example:

>>> from aiviro.actions.documents.pdf import PDFAutoRotatePages
>>> from aiviro.modules.pdf import create_pdf_robot
>>> r = create_pdf_robot()
>>> rotated_pages = PDFAutoRotatePages(pdf_path="path/to/file.pdf")(r)
>>> # use these 'rotated_pages' for further processing

Schemas

pydantic model aiviro.actions.documents.pdf.extract.schemas.PDFComment

Represents a PDF comment, containing the text and author.

field author: str [Required]

field text: str [Required]

pydantic model aiviro.actions.documents.pdf.extract.schemas.PDFAttachment

Represents a PDF attachment, containing the attachment name and its bytes’ content.

field content: bytes [Required]

field name: str [Required]

pydantic model aiviro.actions.documents.pdf.extract.schemas.PDFExtractionResult

Represents the result of extracting Boundboxes, comments and file attachments from a PDF file.

field comments: list[PDFComment] [Optional]

field file_attachments: list[PDFAttachment] [Optional]

field page_boxes: dict[int, list[BoundBox]] [Optional]

pydantic model aiviro.actions.documents.pdf.extract.schemas.PDFComponents

Represents the result of extracting texts, comments, file attachments, images and number of pages from a PDF file.

field comments: list[PDFComment] [Optional]

field file_attachments: list[PDFAttachment] [Optional]

field images: dict[int, list[ndarray]] [Optional]

field number_of_pages: int = 0

Constraints:

ge = 0

field texts: dict[int, list[str]] [Optional]

pydantic model aiviro.modules.pdf.convertor.PDFConfig

It is possible to modify some of the PDF actions (see PDF Documents) by providing this configuration object with desired values.

field dpi: int = 72

field poppler_path: str | Path | None = None

field timeout: int = 30