archaeo_super_prompt.modeling.pdf_to_text.chunking

Scanned document splitting into text chunks with layout metadata.

Functions

get_chunker — Return a Docling Chunker model according to the tokenizer of one embedding model.
get_chunks — Extracts a list of labeled chunks through all the pages of the document.
chunk_to_ds — Gather the list of labeled chunks into a dataframe for all the document batch.

source get_chunker(embed_model_id: str, max_chunk_size: int)

Return a Docling Chunker model according to the tokenizer of one embedding model.

This tokenizer is fast even on the CPU, but must be fetch from the HuggingFace's repositories.

source get_chunks(chunker: HybridChunker, document: Iterator[tuple[PageRange, CorrectlyConvertedDocument]]) → list[tuple[PageRange, BaseChunk]]

Extracts a list of labeled chunks through all the pages of the document.

Parameters

chunker : HybridChunker — the chunker model to chunk according to the layout and the tokenization
document : Iterator[tuple[PageRange, CorrectlyConvertedDocument]] — the document or a list of documents for each page

source chunk_to_ds(pairs: Iterator[tuple[tuple[InterventionId, Path], list[tuple[PageRange, BaseChunk]]]], chunker: HybridChunker) → PDFChunkDataset

Gather the list of labeled chunks into a dataframe for all the document batch.