PDF Documents

Actions

There are multiple actions provided by Aiviro for working with PDF documents.

Cut

class aiviro.actions.documents.pdf.cut.PDFSplitPages(pdf_path: str | Path, split_page_ranges: list[str], output_dir: str | Path | None = None)

Splits the pdf file into multiple files based on the provided page ranges.

Parameters:
  • pdf_path – Path to the pdf file.

  • split_page_ranges – List of page ranges to split the PDF file into.

  • output_dir – Output folder where to save the split PDF files, if not provided, temporary folder is used.

Returns:

List of paths to the split PDF files.

Example:

>>> from aiviro.actions.documents.pdf import PDFSplitPages
>>> from aiviro.modules.pdf import create_pdf_robot
>>>
>>> r = create_pdf_robot()
>>> split_pdf_paths = PDFSplitPages(
>>>     pdf_path="path/to/file.pdf",
>>>     split_page_ranges=["1-2", "3-5", "8"]  # split into 3 files
>>> )(r)
class aiviro.actions.documents.pdf.cut.PDFFindAndCutPages(output_dir: str | Path | None = None)

Cuts the pdf file to the maximum valid page found by searching for the constant texts (e.g. page 1/2) to avoid parsing unrelated and unnecessary info.

Parameters:

output_dir – Output folder where to save the cut PDF file, if not provided, temporary folder is used.

Returns:

Path to the cut PDF file - either original path or a new one with cut pages.

Example:

>>> from aiviro.actions.documents.pdf import PDFFindAndCutPages
>>> from aiviro.modules.pdf import create_pdf_robot
>>>
>>> r = create_pdf_robot()
>>> r.parse(pdf_file="path/to/file.pdf")
>>> cut_pdf_path = PDFFindAndCutPages()(r)

Extract

These actions allow to obtain information from the pdf file.

class aiviro.actions.documents.pdf.extract.PDFExtractTextBoxes(pdf_path: str | Path, force_ocr: bool = False, dpi: int = 200)

Extracts bounding boxes with texts from the provided PDF file. By default, it first tries digital extraction and then OCR. If force OCR is set to True, only OCR extraction is used.

Parameters:
  • pdf_path – Path to the pdf file.

  • force_ocr – If True, forces OCR extraction only.

  • dpi – DPI for the text extraction

Returns:

A dictionary of page index with a list of its extracted bounding boxes.

Example:

>>> from aiviro.actions.documents.pdf import PDFExtractTextBoxes
>>> from aiviro.modules.pdf import create_pdf_robot
>>>
>>> r = create_pdf_robot()
>>> # this action will try digital extraction first and OCR as a fallback
>>> page_boxes = PDFExtractTextBoxes("path/to/file.pdf")(r)
>>> # we want to process pdf via OCR
>>> page_boxes = PDFExtractTextBoxes(
...     pdf_path="path/to/file.pdf",
...     force_ocr=True
... )(r)
class aiviro.actions.documents.pdf.extract.PDFExtractComponents(pdf_path: Path | str)

Returns texts, comments, file attachments, images, and a number of pages from the provided pdf file.

Parameters:

pdf_path – Path to the pdf file.

Returns:

Extracted data of type PDFExtractComponents.

Example:

>>> from aiviro.modules.pdf import create_pdf_robot
>>> from aiviro.actions.documents.pdf import PDFExtractComponents
>>>
>>> # we want to see, for example, comments and attachments from the provided pdf file
>>> r = create_pdf_robot()
>>> extracted_data = PDFExtractComponents(pdf_path="path/to/file.pdf")(r)
>>> print(extracted_data.comments, extracted_data.attachments)
class aiviro.actions.documents.pdf.extract.PDFConvertToImage(pdf_path: Path | str, max_size: int = 3840000, max_dpi: int = 200)

Returns images of the PDF pages of the provided pdf file.

Parameters:
  • pdf_path – Path to the PDF file.

  • max_size – Maximum size of the generated image (combination of width and height).

  • max_dpi – Maximum DPI to use for the conversion.

Returns:

List containing numpy.ndarray images converted from the pdf pages.

Example:

>>> from aiviro.modules.pdf import create_pdf_robot
>>> from aiviro.actions.documents.pdf import PDFConvertToImage
>>>
>>> r = create_pdf_robot()
>>> pages = PDFConvertToImage(pdf_path="path/to/file.pdf")(r)

Modify

class aiviro.actions.documents.pdf.modify.PDFAddText(pdf_path: str | Path, text_to_add: str, coordinates: tuple[int, int], font: PDFTextFonts = PDFTextFonts.TIMES_ROMAN, font_size: int = 12, page_index: int | str = 0, output_path: str | Path | None = None, as_underlay: bool = False, compensate_page_rotation: bool = True)

Adds text to the provided PDF file on the required page.

Parameters:
  • pdf_path – Path to the PDF file.

  • text_to_add – Text that should be added to the PDF file.

  • coordinates – Coordinates where to add the text (x, y).

  • font – Which font to use.

  • font_size – How large should the text be.

  • page_index – Which page to add the text to, can be a range of pages, e.g. 0, “1-3”, “1-”, “1,3,5”

  • output_path – Optional path where to save the modified PDF file (if not provided, a new filename is generated). If path refers to a directory, the modified PDF will be saved to the directory, otherwise it will be saved to the provided filepath.

  • as_underlay – If True, the text is added as an underlay, otherwise as an overlay.

  • compensate_page_rotation – If True, the coordinates are adjusted according to the page rotation. Visually, the text will be added in the same position regardless of the page rotation.

Returns:

Path to the modified PDF file.

Example:

>>> from aiviro.actions.documents.pdf import PDFAddText
>>> from aiviro.modules.pdf import create_pdf_robot
>>>
>>> # add text to the top left corner of the second page
>>> r = create_pdf_robot()
>>> modified_pdf = PDFAddText("path_to_file.pdf", "My Special Text", (20, 20), page_index=1)(r)
>>> # add text to middle on the right of the first page with Helvetica font of size 30
>>> modified_pdf = PDFAddText(
>>>     pdf_path="another.pdf",
>>>     text_to_add="TimeStamp",
>>>     x=400,
>>>     y=400,
>>>     font="Helvetica",
>>>     font_size=30,
>>>     page_index=0
>>> )(r)
>>> # Add text to multiple pages
>>> modified_pdf = PDFAddText(
>>>     pdf_path="another.pdf",
>>>     text_to_add="TimeStamp",
>>>     x=400,
>>>     y=400,
>>>     font="Helvetica",
>>>     font_size=30,
>>>     page_index="1-3"  # add text to pages 1, 2 and 3
>>> )(r)
class aiviro.actions.documents.pdf.modify.PDFAddImage(pdf_path: str | Path, image_path: str | Path, coordinates: tuple[int, int], width: int | None = None, height: int | None = None, page_index: int | str = 0, preserve_ratio: bool = True, output_path: str | Path | None = None, as_underlay: bool = False, compensate_page_rotation: bool = True)

Adds an image to the provided PDF file on the required page.

Parameters:
  • pdf_path – Path to the PDF file.

  • image_path – Path to the image file which should be added to the PDF.

  • coordinates – Coordinates where to add the image (x, y).

  • width – Width to which the image should be resized.

  • height – Height to which the image should be resized.

  • page_index – Which page to add the image to, can be a range of pages, e.g. 0, “1-3”, “1-”, “1,3,5”

  • preserve_ratio – Option whether to keep aspect ratio of the image. If the width and height are not None, the method creates a frame of the given size and adds the image to it’s center while maintaining the aspect ratio.

  • output_path – Optional path where to save the modified PDF file (if not provided, the original file will be overwritten). If path refers to a directory, the modified PDF will be saved to the directory, otherwise it will be saved to the provided filepath.

  • as_underlay – If True, the image is added as an underlay, otherwise as an overlay.

  • compensate_page_rotation – If True, the coordinates are adjusted according to the page rotation. Visually, the image will be added in the same position regardless of the page rotation.

Returns:

Path to the modified PDF file.

Raises:

PDFImageAdditionError – If there was an error during the image addition.

Example:

>>> from aiviro.actions.documents.pdf import PDFAddImage
>>> from aiviro.modules.pdf import create_pdf_robot
>>>
>>> r = create_pdf_robot()
>>> # adds image on the provided coordinates into the supplied pdf
>>> modified_pdf = PDFAddImage(
>>>    pdf_path="path/to/file.pdf",
>>>    image_path="path/to/img.png",
>>>    coordinates=(150, 150),
>>>    preserve_ratio=True,
>>> )(robot=r)
class aiviro.actions.documents.pdf.modify.PDFMergeFiles(files: list[str | Path], output_path: str | Path | None = None)

Merges the given PDF & Image files into a single PDF file. In the order of the list.

Parameters:
  • files – List of PDF and Image files which should be merged. Supported file formats are PDF, PNG, JPG and JPEG.

  • output_path – Optional path where to save the merged PDF file, if not provided, temporary folder is used.

Raises:

ValueError – If the list of files is empty or an unsupported file format is given.

Returns:

Path to the merged PDF file, the path is temporary and will be deleted after the robot is closed.

Example:

>>> from aiviro.actions.documents.pdf import PDFMergeFiles
>>> from aiviro.modules.pdf import create_pdf_robot
>>>
>>> r = create_pdf_robot()
>>> # merge multiple files into a single PDF file
>>> merged_pdf_path = PDFMergeFiles(
>>>    files=["path/to/file_1.jpg", "path/to/file_2.pdf", "path/to/file_3.png", "path/to/file_4.pdf"]
>>> )(robot=r)

Rotate

class aiviro.actions.documents.pdf.rotate.PDFAutoRotatePages(pdf_path: str | Path)

Provides autorotation for pages of the pdf files. Works only with the non-digital files.

Parameters:

pdf_path – Path to the pdf file.

Returns:

List of images with the rotated pdf pages.

Example:

>>> from aiviro.actions.documents.pdf import PDFAutoRotatePages
>>> from aiviro.modules.pdf import create_pdf_robot
>>> r = create_pdf_robot()
>>> rotated_pages = PDFAutoRotatePages(pdf_path="path/to/file.pdf")(r)
>>> # use these 'rotated_pages' for further processing

Data Schemas

pydantic model aiviro.actions.documents.pdf.extract.schemas.PDFComment

Represents a PDF comment, containing the text and author.

field author: str [Required]
field text: str [Required]
pydantic model aiviro.actions.documents.pdf.extract.schemas.PDFAttachment

Represents a PDF attachment, containing the attachment name and its bytes’ content.

field content: bytes [Required]
field name: str [Required]
pydantic model aiviro.actions.documents.pdf.extract.schemas.PDFComponents

Represents the result of extracting texts, comments, file attachments, images and number of pages from a PDF file.

field comments: list[PDFComment] [Optional]
field file_attachments: list[PDFAttachment] [Optional]
field images: dict[int, list[ndarray]] [Optional]
field is_digital: bool = False
field number_of_pages: int = 0
Constraints:
  • ge = 0

field page_boxes: dict[int, list[BoundBox]] [Optional]
property texts: dict[int, list[str]]
class aiviro.actions.documents.pdf.constants.PDFTextFonts(value)

Available fonts for the PDF text addition. Based on the standard PDF type1 fonts.

COURIER = 'Courier'

Courier font

COURIER_BOLD = 'Courier-Bold'

Courier bold font

COURIER_BOLD_OBLIQUE = 'Courier-BoldOblique'

Courier bold oblique font

COURIER_OBLIQUE = 'Courier-Oblique'

Courier oblique font

HELVETICA = 'Helvetica'

Helvetica font

HELVETICA_BOLD = 'Helvetica-Bold'

Helvetica bold font

HELVETICA_BOLD_OBLIQUE = 'Helvetica-BoldOblique'

Helvetica bold oblique font

HELVETICA_OBLIQUE = 'Helvetica-Oblique'

Helvetica oblique font

SYMBOL = 'Symbol'

Symbol font

TIMES_ROMAN = 'Times-Roman'

Times-Roman font

TIMES_BOLD = 'Times-Bold'

Times-Bold font

TIMES_BOLD_ITALIC = 'Times-BoldItalic'

Times-BoldItalic font

TIMES_ITALIC = 'Times-Italic'

Times-Italic font

ZAPF_DINGBATS = 'ZapfDingbats'

ZapfDingbats font