archaeo_super_prompt.modeling.pdf_to_text.cache_docling_documents

source module archaeo_super_prompt.modeling.pdf_to_text.cache_docling_documents

Manage the manual cache of the docling documents extracted from the pdf.

The Docling documents wear all the information a VLLM can extract from a pdf document. Then, we define in this module how to cache this output to avoid to recompute it with a VLLM call.

Classes

ArtificialPDFData — Data for saving data about a bufferized PDF document.

Functions

get_yaml_file_for_pdf — Return a yaml file in which the extracted docling document can be cached.
cache_docling_doc_on_disk — Save the docling document in the given yaml file.
load_docling_doc_from_cache — Reload a cached docling document from its cached yaml file.

source class ArtificialPDFData()

Bases : NamedTuple

Data for saving data about a bufferized PDF document.

source get_yaml_file_for_pdf(source_pdf_path: Path | ArtificialPDFData) → Path

Return a yaml file in which the extracted docling document can be cached.

source cache_docling_doc_on_disk(docling_document: CorrectlyConvertedDocument | None, file_path: Path)

Save the docling document in the given yaml file.

If the scanning has failed (most of the time for timeout reason), then None is input and an empty file will be saved, so, even if the execution has failed, it will not be executed again as it is assumed it will fail again.

source load_docling_doc_from_cache(file_path_in_cache: Path) → CorrectlyConvertedDocument | None

Reload a cached docling document from its cached yaml file.