archaeo_super_prompt.types.pdfchunks

Abstract data type for handling a dataset of read pdfs.

Attributes

PDFChunk : NB — this type of row is unnormalized for a memory-efficient processing but this might not be an issue in our pipeline, as the datasets are not huge and the time processing wille be negligible next to the LLM and Embedding model inferences

Classes

PDFChunkSetPerInterventionSchema
PDFChunkDatasetSchema
PDFChunkPerInterventionDataset — DataFrame class wrapper to customize the auto-displaying from tracing tools such as mlflow.

Functions

Bases : DataFrameModel

DataFrame class wrapper to customize the auto-displaying from tracing tools such as mlflow.

Methods

getExtractedPdfContent — Let dataset be a set of chunks from several pdf files related to a single intervention. Computes the batch of chunk sources from this dataset.
to_readable_context_string

Let dataset be a set of chunks from several pdf files related to a single intervention. Computes the batch of chunk sources from this dataset.

The dataset can be partial if a selection of chunks in each files has already been carried out.

source composePdfChunkDataset(datasets: Generator[PDFChunkDataset] | Iterable[PDFChunkDataset]) → PDFChunkDataset

source buildPdfChunkDataset(chunks: list[PDFChunk]) → PDFChunkDataset