archaeo_super_prompt.types.pdfchunks
source module archaeo_super_prompt.types.pdfchunks
Abstract data type for handling a dataset of read pdfs.
Attributes
-
PDFChunk : NB — this type of row is unnormalized for a memory-efficient processing but this might not be an issue in our pipeline, as the datasets are not huge and the time processing wille be negligible next to the LLM and Embedding model inferences
Classes
-
PDFChunkPerInterventionDataset — DataFrame class wrapper to customize the auto-displaying from tracing tools such as mlflow.
Functions
source class PDFChunkSetPerInterventionSchema()
Bases : DataFrameModel
source class PDFChunkDatasetSchema()
Bases : PDFChunkSetPerInterventionSchema
source class PDFChunkPerInterventionDataset(data: DataFrame[PDFChunkSetPerInterventionSchema])
DataFrame class wrapper to customize the auto-displaying from tracing tools such as mlflow.
Methods
-
getExtractedPdfContent — Let dataset be a set of chunks from several pdf files related to a single intervention. Computes the batch of chunk sources from this dataset.
source method PDFChunkPerInterventionDataset.getExtractedPdfContent() → PDFSources
Let dataset be a set of chunks from several pdf files related to a single intervention. Computes the batch of chunk sources from this dataset.
The dataset can be partial if a selection of chunks in each files has already been carried out.
source method PDFChunkPerInterventionDataset.to_readable_context_string() → PDFChunkEnumeration
source composePdfChunkDataset(datasets: Generator[PDFChunkDataset] | Iterable[PDFChunkDataset]) → PDFChunkDataset
source buildPdfChunkDataset(chunks: list[PDFChunk]) → PDFChunkDataset