archaeo_super_prompt.modeling.pdf_to_text.stream_ocr_manual

source module archaeo_super_prompt.modeling.pdf_to_text.stream_ocr_manual

Better OCR model with VLLM.

Functions

ollama_vlm_options — Return a configuration for vlm model set with ollama.
vllm_vlm_options — Return a configuration for vlm model set with a vllm server (so an OpenAI compatible API).
converter — Return a Docling PDF converter object from an ollama vlm configuration.
process_documents — Convert the documents into text with Docling, using the given converter.

source ollama_vlm_options(model: str, prompt: str, response_format: Literal[ResponseFormat.HTML, ResponseFormat.MARKDOWN] = ResponseFormat.MARKDOWN, allowed_timeout: int = 60 * 3)

Return a configuration for vlm model set with ollama.

Parameters

model : str — the string identifier of the vllm model in ollama
prompt : str — a string to prompt to the vllm to contextualize its OCR task
response_format : Literal[ResponseFormat.HTML, ResponseFormat.MARKDOWN] — a supported response format for the vllm
allowed_timeout : int — the allowed time for processing one page in one document (default to 3 minutes)

source vllm_vlm_options(model: str, prompt: str, response_format: Literal[ResponseFormat.HTML, ResponseFormat.MARKDOWN] = ResponseFormat.MARKDOWN, allowed_timeout: int = 60 * 3)

Return a configuration for vlm model set with a vllm server (so an OpenAI compatible API).

Parameters

model : str — the string identifier of the vllm model in ollama
prompt : str — a string to prompt to the vllm to contextualize its OCR task
response_format : Literal[ResponseFormat.HTML, ResponseFormat.MARKDOWN] — a supported response format for the vllm
allowed_timeout : int — the allowed time for processing one page in one document (default to 3 minutes)

source converter(ollama_vlm_options: ApiVlmOptions)

Return a Docling PDF converter object from an ollama vlm configuration.

source process_documents(file_inputs: list[tuple[InterventionId, Path]], documentConvertor: DocumentConverter, incipit_only=True) → Iterator[tuple[tuple[InterventionId, Path], Iterator[tuple[PageRange, CorrectlyConvertedDocument]]]]

Convert the documents into text with Docling, using the given converter.

Returns

Iterator[tuple[tuple[InterventionId, Path], Iterator[tuple[PageRange, CorrectlyConvertedDocument]]]] — For each file, either a list of one docling document, if all the document can have been procesed at once, or a list of nullable docling documents for each document page. For some pages, the a null value is put when the page reading has failed.