Skip to content

archaeo_super_prompt.modeling.struct_extract.field_extractor

source module archaeo_super_prompt.modeling.struct_extract.field_extractor

Generic pipeline Transformer for extracting one field from featured chunks.

This transformer is a classifier which scorable and trainable.

Classes

  • FieldExtractor Abstract class for extracting one field from featured chunks.

Functions

source to_prediction(output: BaseModel)dspy.Prediction

Call this function with the pydantic-typed output for return in forward.

source prediction_to_output[DSPyOutput](output_constructor: type[DSPyOutput], pred: dspy.Prediction)DSPyOutput

Inverse of the method above.

Expect the prediction to be built from the _to_prediction method above

source class FieldExtractor[DSPyInput: BaseModel, DSPyOutput: BaseModel, InputDataFrameWithKnowledge: extract_input_type.BaseInputForExtraction, InputDataFrameWithKnowledgeRowSchema: extract_input_type.BaseInputForExtractionRowSchema, DFOutput: BasePerInterventionFeatureSchema](llm_model_provider: LLMProvider, llm_model_id: str, llm_temperature: float, model: dspy.Module, example: tuple[DSPyInput, DSPyOutput], output_constructor: type[DSPyOutput])

Bases : DetailedEvaluatorMixin[DataFrame[InputDataFrameWithKnowledge], MagohDataset, DataFrame[ResultSchema]], ABC

Abstract class for extracting one field from featured chunks.

Initialize the abstract class with the custom dspy module.

Genericity

As Python does not support a lot of type checking features, the genericity constraints are explicited here: - DInput is a subtype of TypedDict, whose keys bring semantics used by the DSPy model as input in its forward method. - DOutput is a subtype of TypedDict - DFOutputType is a subtype of pandera.pandas.DataFrameModel

Parameters

  • llm_model_provider : LLMProvider the service from which the llm must be fetched

  • llm_model_id : str the dspy chat lm to be used for the extraction

  • llm_temperature : float the temperature of the llm during the prompts of this model

  • model : dspy.Module the dspy module which will be used for the training and the inference

  • example : tuple[DSPyInput, DSPyOutput] a dspy input-output pair enabling to type check at runtime the genericity and also to be able to log the model in mlflow

  • output_constructor : type[DSPyOutput] the type of the output model for building it generically from dictionnary expansion

Environment variables

According to the llm provider, either the following env vars is required: OPENAI_API_KEY OLLAMA_SERVER_BASE_URL (default to http://localhost:11434) VLLM_SERVER_BASE_URL (default to http://localhost:8006/v1)

Attributes

  • signature_example Return an example of input/output dict pair for the dspy model.

  • lm Return the llm model.

Methods

  • fit Optimize the dspy model according to the given dataset.

  • predict Generic transform operation.

  • filter_training_dataset Among the given set of intervention records, select only those with suitable answers for a training or an evaluation.

  • score Run a local evaluation of the dpsy model over the given X dataset.

  • score_and_transform

  • field_to_be_extracted A human label/description of the field related to the Extractor.

source method FieldExtractor.fit(X: DataFrame[InputDataFrameWithKnowledge], y: MagohDataset, *, compiled_dspy_model_path: Path | None = None, skip_optimization=False, **kwargs)

Optimize the dspy model according to the given dataset.

Parameters

  • X : DataFrame[InputDataFrameWithKnowledge] the input dataframe with the required fields for the FieldExtractor

  • y : MagohDataset the Magoh training dataset

  • compiled_dspy_model_path : Path | None if given, a path to an already optimized dspy model, so this prompt model is directly used without reoptimize the program

  • skip_optimization if set to True, then the model is fitted with the not optimized dspy program

  • kwargs nothing usefull (just to fit the initial overriding)

source method FieldExtractor.predict(X: DataFrame[InputDataFrameWithKnowledge])DataFrame[DFOutput]

Generic transform operation.

source classmethod FieldExtractor.filter_training_dataset(y: MagohDataset, ids: set[InterventionId])set[InterventionId]

Among the given set of intervention records, select only those with suitable answers for a training or an evaluation.

Raises

  • NotImplementedError

source method FieldExtractor.score(X: DataFrame[InputDataFrameWithKnowledge], y: MagohDataset, sample_weight=None)

Run a local evaluation of the dpsy model over the given X dataset.

Also save the per-field results for each test record in a cached dataframe, accessible after the function call with the score_results property (it will not equal None after a sucessful run of this method)

To fit the sklearn Classifier interface, this method return a reduced floating metric value for the model.

source method FieldExtractor.score_and_transform(X, y)

source staticmethod FieldExtractor.field_to_be_extracted()str

A human label/description of the field related to the Extractor.

Raises

  • NotImplementedError

source property FieldExtractor.signature_example

Return an example of input/output dict pair for the dspy model.

This property is usefull for a logging by mlflow.

source property FieldExtractor.lm

Return the llm model.