archaeo_super_prompt.modeling.struct_extract.field_extractor

source module archaeo_super_prompt.modeling.struct_extract.field_extractor

Generic pipeline Transformer for extracting one field from featured chunks.

This transformer is a classifier which scorable and trainable.

Classes

FieldExtractor — Abstract class for extracting one field from featured chunks.

Functions

to_prediction — Call this function with the pydantic-typed output for return in forward.
prediction_to_output — Inverse of the method above.

source to_prediction(output: BaseModel) → dspy.Prediction

Call this function with the pydantic-typed output for return in forward.

source prediction_to_output[DSPyOutput](output_constructor: type[DSPyOutput], pred: dspy.Prediction) → DSPyOutput

Inverse of the method above.

Expect the prediction to be built from the _to_prediction method above

source class FieldExtractor[DSPyInput: BaseModel, DSPyOutput: BaseModel, InputDataFrameWithKnowledge: extract_input_type.BaseInputForExtraction, InputDataFrameWithKnowledgeRowSchema: extract_input_type.BaseInputForExtractionRowSchema, DFOutput: BasePerInterventionFeatureSchema](llm_model_provider: LLMProvider, llm_model_id: str, llm_temperature: float, model: dspy.Module, example: tuple[DSPyInput, DSPyOutput], output_constructor: type[DSPyOutput])

Bases : DetailedEvaluatorMixin[DataFrame[InputDataFrameWithKnowledge], MagohDataset, DataFrame[ResultSchema]], ABC

Abstract class for extracting one field from featured chunks.

Initialize the abstract class with the custom dspy module.

Genericity

As Python does not support a lot of type checking features, the genericity constraints are explicited here: - DInput is a subtype of TypedDict, whose keys bring semantics used by the DSPy model as input in its forward method. - DOutput is a subtype of TypedDict - DFOutputType is a subtype of pandera.pandas.DataFrameModel

Parameters

llm_model_provider : LLMProvider — the service from which the llm must be fetched
llm_model_id : str — the dspy chat lm to be used for the extraction
llm_temperature : float — the temperature of the llm during the prompts of this model
model : dspy.Module — the dspy module which will be used for the training and the inference
example : tuple[DSPyInput, DSPyOutput] — a dspy input-output pair enabling to type check at runtime the genericity and also to be able to log the model in mlflow
output_constructor : type[DSPyOutput] — the type of the output model for building it generically from dictionnary expansion

Environment variables

According to the llm provider, either the following env vars is required: OPENAI_API_KEY OLLAMA_SERVER_BASE_URL (default to http://localhost:11434) VLLM_SERVER_BASE_URL (default to http://localhost:8006/v1)

Attributes

signature_example — Return an example of input/output dict pair for the dspy model.
lm — Return the llm model.

Methods

fit — Optimize the dspy model according to the given dataset.
predict — Generic transform operation.
filter_training_dataset — Among the given set of intervention records, select only those with suitable answers for a training or an evaluation.
score — Run a local evaluation of the dpsy model over the given X dataset.
score_and_transform
field_to_be_extracted — A human label/description of the field related to the Extractor.

source method FieldExtractor.fit(X: DataFrame[InputDataFrameWithKnowledge], y: MagohDataset, *, compiled_dspy_model_path: Path | None = None, skip_optimization=False, **kwargs)

Optimize the dspy model according to the given dataset.

Parameters

X : DataFrame[InputDataFrameWithKnowledge] — the input dataframe with the required fields for the FieldExtractor
y : MagohDataset — the Magoh training dataset
compiled_dspy_model_path : Path | None — if given, a path to an already optimized dspy model, so this prompt model is directly used without reoptimize the program
skip_optimization — if set to True, then the model is fitted with the not optimized dspy program
kwargs — nothing usefull (just to fit the initial overriding)

source method FieldExtractor.predict(X: DataFrame[InputDataFrameWithKnowledge]) → DataFrame[DFOutput]

Generic transform operation.

source classmethod FieldExtractor.filter_training_dataset(y: MagohDataset, ids: set[InterventionId]) → set[InterventionId]

Among the given set of intervention records, select only those with suitable answers for a training or an evaluation.

Raises

NotImplementedError

source method FieldExtractor.score(X: DataFrame[InputDataFrameWithKnowledge], y: MagohDataset, sample_weight=None)

Run a local evaluation of the dpsy model over the given X dataset.

Also save the per-field results for each test record in a cached dataframe, accessible after the function call with the score_results property (it will not equal None after a sucessful run of this method)

To fit the sklearn Classifier interface, this method return a reduced floating metric value for the model.

source method FieldExtractor.score_and_transform(X, y)

source staticmethod FieldExtractor.field_to_be_extracted() → str

A human label/description of the field related to the Extractor.

Raises

NotImplementedError

source property FieldExtractor.signature_example

Return an example of input/output dict pair for the dspy model.

This property is usefull for a logging by mlflow.

source property FieldExtractor.lm

Return the llm model.