Skip to content

archaeo_super_prompt.modeling.entity_extractor

source package archaeo_super_prompt.modeling.entity_extractor

Root of the module for infering in the NER model.

The purpose of this model is to extract hints about chunks for helping the final LLM model to extract some named values for some fields.

Classes

  • NerModel Transformer adding identified NamedRecognition features for each chunk.

  • NeSelector Filter of chunks according to wanted strings among the entities.

  • ChunksWithThesaurus For each filtered chunk, a list of the identified thesaurus.

  • NamedEntityField Data for a structured data field with terms identifiable by NER.

source class NerModel(allowed_ner_confidence=0.7)

Bases : BaseTransformer

Transformer adding identified NamedRecognition features for each chunk.

Instantiate the Named Entity Recognition model.

Environment variables

The NER_MODEL_HOST_URL env var must be set with the base url of the remote model for the named entity recognition (e.g. 'http://localhost:8004')

Methods

source method NerModel.transform(X: PDFChunkDataset)DataFrame[EntitiesPerChunkSchema]

source class NeSelector(field_name: str, compatible_entities: set[NerXXLEntities], wanted_matches: ThesaurusProvider, keep_chunks_without_identified_values=False)

Bases : BaseTransformer

Filter of chunks according to wanted strings among the entities.

Initialize the Named Entity Selector from the data about the field.

Parameters

  • field_name : str a label describing the entities to be extracted

  • compatible_entities : set[NerXXLEntities] a set of entity types to consider for selecting the chunks

  • wanted_matches : ThesaurusProvider a frozen function giving at runtime the list of matches (can be huge)

  • keep_chunks_without_identified_values if True, the chunks with entities in the desired group of entity types are always kept, even if no thesaurus has been identified among these entities. If False, these chunks are only kept if there is not any chunk where hesaurus has been identified.

Returns

  • A Transformer to select only chunks in which named thesaurus occur.

Methods

  • transform Filter the identified named entities and filter the chunks.

source method NeSelector.transform(X: DataFrame[ChunksWithEntities])DataFrame[ChunksWithThesaurus]

Filter the identified named entities and filter the chunks.

According to the information about the field to be extracted, filter the named entities for each chunk and keep only chunks with a non-empty filtered named-entities list.

source class ChunksWithThesaurus()

Bases : PDFChunkDatasetSchema

For each filtered chunk, a list of the identified thesaurus.

The list can be empty if no thesaurus has been identified in the chunk but named entities in the type group of interest have been identified. This enable to keep chunks to be read by the LLM if no fuzzymatched thesaurus has been identified.

The list represents a set and contains the identifiers of the thesaurus.

source class NamedEntityField()

Bases : NamedTuple

Data for a structured data field with terms identifiable by NER.

Thesaurus values is a frozen function which give the list of thesaurus with their related identifier.