Skip to content

archaeo_super_prompt.modeling.entity_extractor.fuzzy_match

source module archaeo_super_prompt.modeling.entity_extractor.fuzzy_match

Identification of thesaurus with fuzzymatching in text chunks.

Functions

source extended_expression(content: str, match: Match)str

Return the extended expression around a given match.

Examples

"WE ARE IN PONTEDERA", "PONTE" -> "PONTEDERA" "WE ARE IN AN APPARTEMENT", "PART" -> "APPARTEMENT" "WE ARE IN AN APPARTEMENT", "APPARTEMENT" -> "APPARTEMENT" "I am working for the Soprintendenza Archeologica della Toscana", "Soprintendenza Archeologica della Toscana" -> "Soprintendenza Archeologica della Toscana" "I am working for the Soprintendenza Archeologica della Toscana", "intendenza Archeologica della Toscana" -> "Soprintendenza Archeologica della Toscana"

source filter_occurences(content: str, thesaurus_value: str, matches: list[Match])list[Match]

Keep the matches whose extended expression still match with the thesarusus value.

For example, if "PART" is detected in the content "WE ARE IN AN APPARTEMENT", then this match will be excluded.

source extract_from_content(content: str, entity_set: list[CompleteEntity], wanted_entities: list[tuple[int, str]])set[int] | None

We expect the wanted entities and the content to be normalized.

source normalize_text(txt: str)str

Apply simple normalization to make the comparison easier.

source extract_wanted_entities(chunk_contents: Iterator[str], complete_entity_sets: Iterator[list[CompleteEntity]], thesauri_factory: ThesaurusProvider)Iterator[set[int] | None]

Filter only the entities that fuzzymatch with wanted thesaurus.

Parameters

  • chunk_contents : Iterator[str] for each chunk, its text content

  • complete_entity_sets : Iterator[list[CompleteEntity]] a set for each text chunk of occurring entities only in a group of entity types

  • thesauri_factory : ThesaurusProvider a set of wanted string values to be extracted in the same group of entity types

ReturnType

A list for each text chunk of the matched thesaurus above the given distance treshold. If there is not any filtered entity for a given chunk, then None is returned for this chunk instead of the empty set. The empty set means that the chunk contains entities that match the group of entities of interests but these entities does not match the thesaurus.