Skip to content

archaeo_super_prompt.modeling.pdf_to_text.stream_ocr_manual

source module archaeo_super_prompt.modeling.pdf_to_text.stream_ocr_manual

Better OCR model with VLLM.

Functions

  • ollama_vlm_options Return a configuration for vlm model set with ollama.

  • vllm_vlm_options Return a configuration for vlm model set with a vllm server (so an OpenAI compatible API).

  • converter Return a Docling PDF converter object from an ollama vlm configuration.

  • process_documents Convert the documents into text with Docling, using the given converter.

source ollama_vlm_options(model: str, prompt: str, response_format: Literal[ResponseFormat.HTML, ResponseFormat.MARKDOWN] = ResponseFormat.MARKDOWN, allowed_timeout: int = 60 * 3)

Return a configuration for vlm model set with ollama.

Parameters

  • model : str the string identifier of the vllm model in ollama

  • prompt : str a string to prompt to the vllm to contextualize its OCR task

  • response_format : Literal[ResponseFormat.HTML, ResponseFormat.MARKDOWN] a supported response format for the vllm

  • allowed_timeout : int the allowed time for processing one page in one document (default to 3 minutes)

source vllm_vlm_options(model: str, prompt: str, response_format: Literal[ResponseFormat.HTML, ResponseFormat.MARKDOWN] = ResponseFormat.MARKDOWN, allowed_timeout: int = 60 * 3)

Return a configuration for vlm model set with a vllm server (so an OpenAI compatible API).

Parameters

  • model : str the string identifier of the vllm model in ollama

  • prompt : str a string to prompt to the vllm to contextualize its OCR task

  • response_format : Literal[ResponseFormat.HTML, ResponseFormat.MARKDOWN] a supported response format for the vllm

  • allowed_timeout : int the allowed time for processing one page in one document (default to 3 minutes)

source converter(ollama_vlm_options: ApiVlmOptions)

Return a Docling PDF converter object from an ollama vlm configuration.

source process_documents(file_inputs: list[tuple[InterventionId, Path]], documentConvertor: DocumentConverter, incipit_only=True)Iterator[tuple[tuple[InterventionId, Path], Iterator[tuple[PageRange, CorrectlyConvertedDocument]]]]

Convert the documents into text with Docling, using the given converter.

Returns

  • Iterator[tuple[tuple[InterventionId, Path], Iterator[tuple[PageRange, CorrectlyConvertedDocument]]]] For each file, either a list of one docling document, if all the document can have been procesed at once, or a list of nullable docling documents for each document page. For some pages, the a null value is put when the page reading has failed.