Skip to content

archaeo_super_prompt.modeling.pdf_to_text.chunking

source module archaeo_super_prompt.modeling.pdf_to_text.chunking

Scanned document splitting into text chunks with layout metadata.

Functions

  • get_chunker Return a Docling Chunker model according to the tokenizer of one embedding model.

  • get_chunks Extracts a list of labeled chunks through all the pages of the document.

  • chunk_to_ds Gather the list of labeled chunks into a dataframe for all the document batch.

source get_chunker(embed_model_id: str, max_chunk_size: int)

Return a Docling Chunker model according to the tokenizer of one embedding model.

This tokenizer is fast even on the CPU, but must be fetch from the HuggingFace's repositories.

source get_chunks(chunker: HybridChunker, document: Iterator[tuple[PageRange, CorrectlyConvertedDocument]])list[tuple[PageRange, BaseChunk]]

Extracts a list of labeled chunks through all the pages of the document.

Parameters

  • chunker : HybridChunker the chunker model to chunk according to the layout and the tokenization

  • document : Iterator[tuple[PageRange, CorrectlyConvertedDocument]] the document or a list of documents for each page

source chunk_to_ds(pairs: Iterator[tuple[tuple[InterventionId, Path], list[tuple[PageRange, BaseChunk]]]], chunker: HybridChunker)PDFChunkDataset

Gather the list of labeled chunks into a dataframe for all the document batch.