How-to guides
Get records and PDF files from the Magoh Dataset
To train or score any extraction model, you will need to instantiate a
MagohDataset object. You can select the records with providing either:
- A specific set of intervention identifiers
from archaeo_super_prompt.dataset import MagohDataset dataset = MagohDataset({36187, 31977, 34787}) - A number of records with a seed value for a random sampling of the records:
from archaeo_super_prompt.dataset import MagohDataset, SamplingParams # select 100 records, regardless when they were written dataset = MagohDataset(SamplingParams(100, 0.2, False))
Build the old extraction model
The first extraction model tried to extract all the Magoh's fields from the PDF
document. See the archaeo_super_prompt.modeling.legacy_predict module for an
example of initialization. You may want to edit the parameters of the vision LM
and the LLM for the two submodels according to your remote LLM models.
Build the last extraction model
The last general extraction model is able to predict the value of the start date of an archaeological intervention and the municipality where it took place. This model, can be fitted and scored.
The whole model is a Directed-Acyclic Graph of the library skdag, and can be
handled from the get_training_dag function. See the
archaeo_super_prompt.modeling.train and
archaeo_super_prompt.modeling.predict modules for in detail examples. As with
the old extraction model, you may want to edit the parameters of the submodels
according to your available LLMs.
The fitting and the scoring can be programmed as in the standard framework of the Pipelines of scikit-learn.
from archaeo_super_prompt.modeling.train import get_training_dag
(
preprocessing_dag_builder,
extractor_components,
final_union_component
) = get_training_dag()
_, (comune_extractor, _) = extractor_components
preprocessing_dag = preprocessing_dag_builder.make_dag()
preprocessed_train_inputs = preprocessing_dag.fit_transform(
train_pdf_paths,
magoh_train_dataset
)
preprocessed_eval_inputs = preprocessing_dag.transform(
eval_pdf_paths,
magoh_train_dataset
)
municipalities_in_train_records = comune_extractor.fit_predict(
preprocessed_train_inputs["comune-CM"],
magoh_train_dataset
)
score = comune_extractor.score(
preprocessed_eval_inputs["comune-CM"],
magoh_train_dataset
)
See the notebooks/exploratory/3.0-Convolutio-complete_pipeline.ipynb notebook
for a complete overview of the training and the evaluation of the model.
To help the developer choosing useful Magoh's records for the training and the evaluation of a
FieldExtractor, thefilter_training_datasetmethod is useful.
Inspect the prompts
The previous notebook contains also the management of MLFlow to autotrace the operations of DSPy during the fitting and the inferences of the field extractors. For this notebook to correctly run, you'll need to start the MLFlow server with running this command inside your environment:
just serve-tracer
With the tracing of MLFlow, you can inspect the prompts exchanges between DSPy and your remote LLM. The links to the experiments and runs are given at the runtime in the outputs cells of the notebook.
Visualize the evaluation results
The field extractors have the score_and_transform method which returns a
dataframe with the metric values, the predicted and the expected results, and
obeys to the ResultSchema schema.
You can concatenate this dataframe with the other ResultSchema dataframes
that the other field extractors output, so you can compare the metrics values
between the fields to be predicted. The archaeo_super_prompt.visualization
module enables to plot the metric for all the fields in this dataframe. See the
score_dag function in the archaeo_super_prompt.modeling.predict module for
more details.
It is also possible to use the mlflow.log_metric method to save and visualize
all the per-field metrics in MLFlow
Implement another FieldExtractor
To extend the DAG with the extraction of another field, you can implement the
FieldExtractor abstract class in a new child class inside the
archaeo_super_prompt.modeling.struct_extract.extractors submodule.
You will have to implement a DSPy module which fit two input and output
pydantic BaseModels that you will have to define according to your extraction
model. The forward method will have to be defined suitably according to these
models to type-safely bind your DSPy module to your FieldExtractor child
class:
- The input arguments of the method will be the fields of the input pydantic model
- The return value of the method will be a
dspy.Predictionbuilt from an instance of the output pydantic model.
See the ComuneExtractor implementation for an example.
Please pay attention to the implementation of
filter_training_dataset, which can be done according to knowledge about the field to be predicted. Some fields have in face their values which might cause problems in cause of errors of insertion from the contributors or cases of intervention which are not useful for a training or an evaluation. An example is the duration of an intervention (see the page inreferences/directory for more details).