Skip to content

archaeo_super_prompt.modeling.pdf_to_text.document_division

source module archaeo_super_prompt.modeling.pdf_to_text.document_division

Utility functions to divide the pages of a PDF document into slices.

Functions

source get_page_ranges(doc_page_number: int, page_batch_size: int, border_page_nb: int | None = None)list[PageRange]

Divide a number of pages into batch intervals.

If only the header and the footer of the document are wanted, then only divide the first pages and the last pages into batch intervals. Set the argument border_page_nb to trigger such a behaviour.

The number of page in a batch is set according to the number of page the remote LLM is able to process in parallel.

Parameters

  • doc_page_number : int the total number of pages in the document

  • page_batch_size : int the number of pages in a slice

  • border_page_nb : int | None if given, only keep this number of page from the start and from the end (so 2*border_page_nb) will be processed with the output ranges