How-to guides

Get records and PDF files from the Magoh Dataset

To train or score any extraction model, you will need to instantiate a MagohDataset object. You can select the records with providing either:

A specific set of intervention identifiers

from archaeo_super_prompt.dataset import MagohDataset
dataset = MagohDataset({36187, 31977, 34787})

A number of records with a seed value for a random sampling of the records:

from archaeo_super_prompt.dataset import MagohDataset, SamplingParams
# select 100 records, regardless when they were written
dataset = MagohDataset(SamplingParams(100, 0.2, False))

Build the old extraction model

The first extraction model tried to extract all the Magoh's fields from the PDF document. See the archaeo_super_prompt.modeling.legacy_predict module for an example of initialization. You may want to edit the parameters of the vision LM and the LLM for the two submodels according to your remote LLM models.

Build the last extraction model

The last general extraction model is able to predict the value of the start date of an archaeological intervention and the municipality where it took place. This model, can be fitted and scored.

The whole model is a Directed-Acyclic Graph of the library skdag, and can be handled from the get_training_dag function. See the archaeo_super_prompt.modeling.train and archaeo_super_prompt.modeling.predict modules for in detail examples. As with the old extraction model, you may want to edit the parameters of the submodels according to your available LLMs.

The fitting and the scoring can be programmed as in the standard framework of the Pipelines of scikit-learn.

from archaeo_super_prompt.modeling.train import get_training_dag
(
  preprocessing_dag_builder,
  extractor_components,
  final_union_component
) = get_training_dag()

_, (comune_extractor, _) = extractor_components
preprocessing_dag = preprocessing_dag_builder.make_dag()

preprocessed_train_inputs = preprocessing_dag.fit_transform(
  train_pdf_paths,
  magoh_train_dataset
)
preprocessed_eval_inputs = preprocessing_dag.transform(
  eval_pdf_paths,
  magoh_train_dataset
)

municipalities_in_train_records = comune_extractor.fit_predict(
  preprocessed_train_inputs["comune-CM"],
  magoh_train_dataset
)
score = comune_extractor.score(
  preprocessed_eval_inputs["comune-CM"],
  magoh_train_dataset
)

See the notebooks/exploratory/3.0-Convolutio-complete_pipeline.ipynb notebook for a complete overview of the training and the evaluation of the model.

To help the developer choosing useful Magoh's records for the training and the evaluation of a FieldExtractor, the filter_training_dataset method is useful.

Inspect the prompts

The previous notebook contains also the management of MLFlow to autotrace the operations of DSPy during the fitting and the inferences of the field extractors. For this notebook to correctly run, you'll need to start the MLFlow server with running this command inside your environment:

just serve-tracer

With the tracing of MLFlow, you can inspect the prompts exchanges between DSPy and your remote LLM. The links to the experiments and runs are given at the runtime in the outputs cells of the notebook.

Visualize the evaluation results

The field extractors have the score_and_transform method which returns a dataframe with the metric values, the predicted and the expected results, and obeys to the ResultSchema schema.

You can concatenate this dataframe with the other ResultSchema dataframes that the other field extractors output, so you can compare the metrics values between the fields to be predicted. The archaeo_super_prompt.visualization module enables to plot the metric for all the fields in this dataframe. See the score_dag function in the archaeo_super_prompt.modeling.predict module for more details.

It is also possible to use the mlflow.log_metric method to save and visualize all the per-field metrics in MLFlow

Implement another `FieldExtractor`

To extend the DAG with the extraction of another field, you can implement the FieldExtractor abstract class in a new child class inside the archaeo_super_prompt.modeling.struct_extract.extractors submodule.

You will have to implement a DSPy module which fit two input and output pydantic BaseModels that you will have to define according to your extraction model. The forward method will have to be defined suitably according to these models to type-safely bind your DSPy module to your FieldExtractor child class:

The input arguments of the method will be the fields of the input pydantic model
The return value of the method will be a dspy.Prediction built from an instance of the output pydantic model.

See the ComuneExtractor implementation for an example.

Please pay attention to the implementation of filter_training_dataset, which can be done according to knowledge about the field to be predicted. Some fields have in face their values which might cause problems in cause of errors of insertion from the contributors or cases of intervention which are not useful for a training or an evaluation. An example is the duration of an intervention (see the page in references/ directory for more details).