archaeo_super_prompt.modeling.struct_extract.field_extractor
source module archaeo_super_prompt.modeling.struct_extract.field_extractor
Generic pipeline Transformer for extracting one field from featured chunks.
This transformer is a classifier which scorable and trainable.
Classes
-
FieldExtractor — Abstract class for extracting one field from featured chunks.
Functions
-
to_prediction — Call this function with the pydantic-typed output for return in forward.
-
prediction_to_output — Inverse of the method above.
source to_prediction(output: BaseModel) → dspy.Prediction
Call this function with the pydantic-typed output for return in forward.
source prediction_to_output[DSPyOutput](output_constructor: type[DSPyOutput], pred: dspy.Prediction) → DSPyOutput
Inverse of the method above.
Expect the prediction to be built from the _to_prediction method above
source class FieldExtractor[DSPyInput: BaseModel, DSPyOutput: BaseModel, InputDataFrameWithKnowledge: extract_input_type.BaseInputForExtraction, InputDataFrameWithKnowledgeRowSchema: extract_input_type.BaseInputForExtractionRowSchema, DFOutput: BasePerInterventionFeatureSchema](llm_model_provider: LLMProvider, llm_model_id: str, llm_temperature: float, model: dspy.Module, example: tuple[DSPyInput, DSPyOutput], output_constructor: type[DSPyOutput])
Bases : DetailedEvaluatorMixin[DataFrame[InputDataFrameWithKnowledge], MagohDataset, DataFrame[ResultSchema]], ABC
Abstract class for extracting one field from featured chunks.
Initialize the abstract class with the custom dspy module.
Genericity
As Python does not support a lot of type checking features, the genericity constraints are explicited here: - DInput is a subtype of TypedDict, whose keys bring semantics used by the DSPy model as input in its forward method. - DOutput is a subtype of TypedDict - DFOutputType is a subtype of pandera.pandas.DataFrameModel
Parameters
-
llm_model_provider : LLMProvider — the service from which the llm must be fetched
-
llm_model_id : str — the dspy chat lm to be used for the extraction
-
llm_temperature : float — the temperature of the llm during the prompts of this model
-
model : dspy.Module — the dspy module which will be used for the training and the inference
-
example : tuple[DSPyInput, DSPyOutput] — a dspy input-output pair enabling to type check at runtime the genericity and also to be able to log the model in mlflow
-
output_constructor : type[DSPyOutput] — the type of the output model for building it generically from dictionnary expansion
Environment variables
According to the llm provider, either the following env vars is required: OPENAI_API_KEY OLLAMA_SERVER_BASE_URL (default to http://localhost:11434) VLLM_SERVER_BASE_URL (default to http://localhost:8006/v1)
Attributes
-
signature_example — Return an example of input/output dict pair for the dspy model.
-
lm — Return the llm model.
Methods
-
fit — Optimize the dspy model according to the given dataset.
-
predict — Generic transform operation.
-
filter_training_dataset — Among the given set of intervention records, select only those with suitable answers for a training or an evaluation.
-
score — Run a local evaluation of the dpsy model over the given X dataset.
-
field_to_be_extracted — A human label/description of the field related to the Extractor.
source method FieldExtractor.fit(X: DataFrame[InputDataFrameWithKnowledge], y: MagohDataset, *, compiled_dspy_model_path: Path | None = None, skip_optimization=False, **kwargs)
Optimize the dspy model according to the given dataset.
Parameters
-
X : DataFrame[InputDataFrameWithKnowledge] — the input dataframe with the required fields for the FieldExtractor
-
y : MagohDataset — the Magoh training dataset
-
compiled_dspy_model_path : Path | None — if given, a path to an already optimized dspy model, so this prompt model is directly used without reoptimize the program
-
skip_optimization — if set to True, then the model is fitted with the not optimized dspy program
-
kwargs — nothing usefull (just to fit the initial overriding)
source method FieldExtractor.predict(X: DataFrame[InputDataFrameWithKnowledge]) → DataFrame[DFOutput]
Generic transform operation.
source classmethod FieldExtractor.filter_training_dataset(y: MagohDataset, ids: set[InterventionId]) → set[InterventionId]
Among the given set of intervention records, select only those with suitable answers for a training or an evaluation.
Raises
-
NotImplementedError
source method FieldExtractor.score(X: DataFrame[InputDataFrameWithKnowledge], y: MagohDataset, sample_weight=None)
Run a local evaluation of the dpsy model over the given X dataset.
Also save the per-field results for each test record in a cached dataframe, accessible after the function call with the score_results property (it will not equal None after a sucessful run of this method)
To fit the sklearn Classifier interface, this method return a reduced floating metric value for the model.
source method FieldExtractor.score_and_transform(X, y)
source staticmethod FieldExtractor.field_to_be_extracted() → str
A human label/description of the field related to the Extractor.
Raises
-
NotImplementedError
source property FieldExtractor.signature_example
Return an example of input/output dict pair for the dspy model.
This property is usefull for a logging by mlflow.
source property FieldExtractor.lm
Return the llm model.