Skip to content

archaeo_super_prompt.modeling.pdf_to_text

source package archaeo_super_prompt.modeling.pdf_to_text

PDF Ingestion layer with vision llm and chunking model.

Classes

  • VLLM_Preprocessing First PDF ingestion layer for the pipeline. Include vision-llm scan and text chunking.

Modules

source module vllm_doc_chunk_mod

Scanned document splitting into text chunks with layout metadata.

source module vllm_scan_mod

Better OCR model with VLLM.

source class VLLM_Preprocessing(vlm_provider: Literal['ollama', 'vllm', 'openai'], vlm_model_id: str, prompt: str, embedding_model_hf_id: str, incipit_only: bool, max_chunk_size: int = 512, allowed_timeout: int = 60 * 5)

Bases : BaseTransformer

First PDF ingestion layer for the pipeline. Include vision-llm scan and text chunking.

This pipeline FunctionTransformer directly takes in input a batch of paths of PDF files to be ingested. It read the text with a vision-llm and output text chunks with being aware to the layout and a tokenization method to be provided.

Provide the vlm model credentials and other parametres.

Parameters

  • vlm_provider : Literal['ollama', 'vllm', 'openai'] the remote service to connect to

  • vlm_model_id : str the reference of the vision-llm to be called on the Ollama server

  • prompt : str a string to contextualize the ocr operation of the vision llm

  • embedding_model_hf_id : str the identifier on HuggingFace API of the embedding model, so its tokenizer can be fetched

  • incipit_only : bool if only the first pages are scanned or all the document

  • max_chunk_size : int the maximum size of all text chunks

  • allowed_timeout : int the maximum duration for scanning text from one PDF page

Environment variable

The VLM_HOST_URL env var must be set like this : http://localhost:8005

Methods

source method VLLM_Preprocessing.transform(X: PDFPathDataset)PDFChunkDataset