archaeo_super_prompt.modeling.pdf_to_text.document_division
source module archaeo_super_prompt.modeling.pdf_to_text.document_division
Utility functions to divide the pages of a PDF document into slices.
Functions
-
get_page_ranges — Divide a number of pages into batch intervals.
source get_page_ranges(doc_page_number: int, page_batch_size: int, border_page_nb: int | None = None) → list[PageRange]
Divide a number of pages into batch intervals.
If only the header and the footer of the document are wanted, then only divide the first pages and the last pages into batch intervals. Set the argument border_page_nb to trigger such a behaviour.
The number of page in a batch is set according to the number of page the remote LLM is able to process in parallel.
Parameters
-
doc_page_number : int — the total number of pages in the document
-
page_batch_size : int — the number of pages in a slice
-
border_page_nb : int | None — if given, only keep this number of page from the start and from the end (so 2*border_page_nb) will be processed with the output ranges