archaeo_super_prompt.modeling.pdf_to_text.chunking
source module archaeo_super_prompt.modeling.pdf_to_text.chunking
Scanned document splitting into text chunks with layout metadata.
Functions
-
get_chunker — Return a Docling Chunker model according to the tokenizer of one embedding model.
-
get_chunks — Extracts a list of labeled chunks through all the pages of the document.
-
chunk_to_ds — Gather the list of labeled chunks into a dataframe for all the document batch.
source get_chunker(embed_model_id: str, max_chunk_size: int)
Return a Docling Chunker model according to the tokenizer of one embedding model.
This tokenizer is fast even on the CPU, but must be fetch from the HuggingFace's repositories.
source get_chunks(chunker: HybridChunker, document: Iterator[tuple[PageRange, CorrectlyConvertedDocument]]) → list[tuple[PageRange, BaseChunk]]
Extracts a list of labeled chunks through all the pages of the document.
Parameters
-
chunker : HybridChunker — the chunker model to chunk according to the layout and the tokenization
-
document : Iterator[tuple[PageRange, CorrectlyConvertedDocument]] — the document or a list of documents for each page
source chunk_to_ds(pairs: Iterator[tuple[tuple[InterventionId, Path], list[tuple[PageRange, BaseChunk]]]], chunker: HybridChunker) → PDFChunkDataset
Gather the list of labeled chunks into a dataframe for all the document batch.