archaeo_super_prompt.modeling.pdf_to_text
source package archaeo_super_prompt.modeling.pdf_to_text
PDF Ingestion layer with vision llm and chunking model.
Classes
-
VLLM_Preprocessing — First PDF ingestion layer for the pipeline. Include vision-llm scan and text chunking.
Modules
-
vllm_doc_chunk_mod — Scanned document splitting into text chunks with layout metadata.
-
vllm_scan_mod — Better OCR model with VLLM.
source module vllm_doc_chunk_mod
Scanned document splitting into text chunks with layout metadata.
source module vllm_scan_mod
Better OCR model with VLLM.
source class VLLM_Preprocessing(vlm_provider: Literal['ollama', 'vllm', 'openai'], vlm_model_id: str, prompt: str, embedding_model_hf_id: str, incipit_only: bool, max_chunk_size: int = 512, allowed_timeout: int = 60 * 5)
Bases : BaseTransformer
First PDF ingestion layer for the pipeline. Include vision-llm scan and text chunking.
This pipeline FunctionTransformer directly takes in input a batch of paths of PDF files to be ingested. It read the text with a vision-llm and output text chunks with being aware to the layout and a tokenization method to be provided.
Provide the vlm model credentials and other parametres.
Parameters
-
vlm_provider : Literal['ollama', 'vllm', 'openai'] — the remote service to connect to
-
vlm_model_id : str — the reference of the vision-llm to be called on the Ollama server
-
prompt : str — a string to contextualize the ocr operation of the vision llm
-
embedding_model_hf_id : str — the identifier on HuggingFace API of the embedding model, so its tokenizer can be fetched
-
incipit_only : bool — if only the first pages are scanned or all the document
-
max_chunk_size : int — the maximum size of all text chunks
-
allowed_timeout : int — the maximum duration for scanning text from one PDF page
Environment variable
The VLM_HOST_URL env var must be set like this : http://localhost:8005
Methods
source method VLLM_Preprocessing.transform(X: PDFPathDataset) → PDFChunkDataset