Skip to content

archaeo_super_prompt.types.pdfchunks

source module archaeo_super_prompt.types.pdfchunks

Abstract data type for handling a dataset of read pdfs.

Attributes

  • PDFChunk : NB this type of row is unnormalized for a memory-efficient processing but this might not be an issue in our pipeline, as the datasets are not huge and the time processing wille be negligible next to the LLM and Embedding model inferences

Classes

Functions

source class PDFChunkSetPerInterventionSchema()

Bases : DataFrameModel

source class PDFChunkDatasetSchema()

source class PDFChunkPerInterventionDataset(data: DataFrame[PDFChunkSetPerInterventionSchema])

DataFrame class wrapper to customize the auto-displaying from tracing tools such as mlflow.

Methods

source method PDFChunkPerInterventionDataset.getExtractedPdfContent()PDFSources

Let dataset be a set of chunks from several pdf files related to a single intervention. Computes the batch of chunk sources from this dataset.

The dataset can be partial if a selection of chunks in each files has already been carried out.

source method PDFChunkPerInterventionDataset.to_readable_context_string()PDFChunkEnumeration

source composePdfChunkDataset(datasets: Generator[PDFChunkDataset] | Iterable[PDFChunkDataset])PDFChunkDataset

source buildPdfChunkDataset(chunks: list[PDFChunk])PDFChunkDataset