archaeo_super_prompt.modeling.pdf_to_text

source package archaeo_super_prompt.modeling.pdf_to_text

PDF Ingestion layer with vision llm and chunking model.

Classes

VLLM_Preprocessing — First PDF ingestion layer for the pipeline. Include vision-llm scan and text chunking.

Modules

vllm_doc_chunk_mod — Scanned document splitting into text chunks with layout metadata.
vllm_scan_mod — Better OCR model with VLLM.

source module vllm_doc_chunk_mod

Scanned document splitting into text chunks with layout metadata.

source module vllm_scan_mod

Better OCR model with VLLM.

source class VLLM_Preprocessing(vlm_provider: Literal['ollama', 'vllm', 'openai'], vlm_model_id: str, prompt: str, embedding_model_hf_id: str, incipit_only: bool, max_chunk_size: int = 512, allowed_timeout: int = 60 * 5)

Bases : BaseTransformer

First PDF ingestion layer for the pipeline. Include vision-llm scan and text chunking.

This pipeline FunctionTransformer directly takes in input a batch of paths of PDF files to be ingested. It read the text with a vision-llm and output text chunks with being aware to the layout and a tokenization method to be provided.

Provide the vlm model credentials and other parametres.

Parameters

vlm_provider : Literal['ollama', 'vllm', 'openai'] — the remote service to connect to
vlm_model_id : str — the reference of the vision-llm to be called on the Ollama server
prompt : str — a string to contextualize the ocr operation of the vision llm
embedding_model_hf_id : str — the identifier on HuggingFace API of the embedding model, so its tokenizer can be fetched
incipit_only : bool — if only the first pages are scanned or all the document
max_chunk_size : int — the maximum size of all text chunks
allowed_timeout : int — the maximum duration for scanning text from one PDF page

Environment variable

The VLM_HOST_URL env var must be set like this : http://localhost:8005

Methods

transform

source method VLLM_Preprocessing.transform(X: PDFPathDataset) → PDFChunkDataset