archaeo_super_prompt.modeling.entity_extractor

source package archaeo_super_prompt.modeling.entity_extractor

Root of the module for infering in the NER model.

The purpose of this model is to extract hints about chunks for helping the final LLM model to extract some named values for some fields.

Classes

NerModel — Transformer adding identified NamedRecognition features for each chunk.
NeSelector — Filter of chunks according to wanted strings among the entities.
ChunksWithThesaurus — For each filtered chunk, a list of the identified thesaurus.
NamedEntityField — Data for a structured data field with terms identifiable by NER.

source class NerModel(allowed_ner_confidence=0.7)

Bases : BaseTransformer

Transformer adding identified NamedRecognition features for each chunk.

Instantiate the Named Entity Recognition model.

Environment variables

The NER_MODEL_HOST_URL env var must be set with the base url of the remote model for the named entity recognition (e.g. 'http://localhost:8004')

Methods

transform

source method NerModel.transform(X: PDFChunkDataset) → DataFrame[EntitiesPerChunkSchema]

source class NeSelector(field_name: str, compatible_entities: set[NerXXLEntities], wanted_matches: ThesaurusProvider, keep_chunks_without_identified_values=False)

Bases : BaseTransformer

Filter of chunks according to wanted strings among the entities.

Initialize the Named Entity Selector from the data about the field.

Parameters

field_name : str — a label describing the entities to be extracted
compatible_entities : set[NerXXLEntities] — a set of entity types to consider for selecting the chunks
wanted_matches : ThesaurusProvider — a frozen function giving at runtime the list of matches (can be huge)
keep_chunks_without_identified_values — if True, the chunks with entities in the desired group of entity types are always kept, even if no thesaurus has been identified among these entities. If False, these chunks are only kept if there is not any chunk where hesaurus has been identified.

Returns

A Transformer to select only chunks in which named thesaurus occur.

Methods

transform — Filter the identified named entities and filter the chunks.

source method NeSelector.transform(X: DataFrame[ChunksWithEntities]) → DataFrame[ChunksWithThesaurus]

Filter the identified named entities and filter the chunks.

According to the information about the field to be extracted, filter the named entities for each chunk and keep only chunks with a non-empty filtered named-entities list.

source class ChunksWithThesaurus()

Bases : PDFChunkDatasetSchema

For each filtered chunk, a list of the identified thesaurus.

The list can be empty if no thesaurus has been identified in the chunk but named entities in the type group of interest have been identified. This enable to keep chunks to be read by the LLM if no fuzzymatched thesaurus has been identified.

The list represents a set and contains the identifiers of the thesaurus.

source class NamedEntityField()

Bases : NamedTuple

Data for a structured data field with terms identifiable by NER.

Thesaurus values is a frozen function which give the list of thesaurus with their related identifier.