archaeo_super_prompt.modeling.entity_extractor
source package archaeo_super_prompt.modeling.entity_extractor
Root of the module for infering in the NER model.
The purpose of this model is to extract hints about chunks for helping the final LLM model to extract some named values for some fields.
Classes
-
NerModel — Transformer adding identified NamedRecognition features for each chunk.
-
NeSelector — Filter of chunks according to wanted strings among the entities.
-
ChunksWithThesaurus — For each filtered chunk, a list of the identified thesaurus.
-
NamedEntityField — Data for a structured data field with terms identifiable by NER.
source class NerModel(allowed_ner_confidence=0.7)
Bases : BaseTransformer
Transformer adding identified NamedRecognition features for each chunk.
Instantiate the Named Entity Recognition model.
Environment variables
The NER_MODEL_HOST_URL env var must be set with the base url of the remote model for the named entity recognition (e.g. 'http://localhost:8004')
Methods
source method NerModel.transform(X: PDFChunkDataset) → DataFrame[EntitiesPerChunkSchema]
source class NeSelector(field_name: str, compatible_entities: set[NerXXLEntities], wanted_matches: ThesaurusProvider, keep_chunks_without_identified_values=False)
Bases : BaseTransformer
Filter of chunks according to wanted strings among the entities.
Initialize the Named Entity Selector from the data about the field.
Parameters
-
field_name : str — a label describing the entities to be extracted
-
compatible_entities : set[NerXXLEntities] — a set of entity types to consider for selecting the chunks
-
wanted_matches : ThesaurusProvider — a frozen function giving at runtime the list of matches (can be huge)
-
keep_chunks_without_identified_values — if True, the chunks with entities in the desired group of entity types are always kept, even if no thesaurus has been identified among these entities. If False, these chunks are only kept if there is not any chunk where hesaurus has been identified.
Returns
-
A Transformer to select only chunks in which named thesaurus occur.
Methods
-
transform — Filter the identified named entities and filter the chunks.
source method NeSelector.transform(X: DataFrame[ChunksWithEntities]) → DataFrame[ChunksWithThesaurus]
Filter the identified named entities and filter the chunks.
According to the information about the field to be extracted, filter the named entities for each chunk and keep only chunks with a non-empty filtered named-entities list.
source class ChunksWithThesaurus()
Bases : PDFChunkDatasetSchema
For each filtered chunk, a list of the identified thesaurus.
The list can be empty if no thesaurus has been identified in the chunk but named entities in the type group of interest have been identified. This enable to keep chunks to be read by the LLM if no fuzzymatched thesaurus has been identified.
The list represents a set and contains the identifiers of the thesaurus.
source class NamedEntityField()
Bases : NamedTuple
Data for a structured data field with terms identifiable by NER.
Thesaurus values is a frozen function which give the list of thesaurus with their related identifier.