archaeo_super_prompt.modeling.entity_extractor.fuzzy_match
source module archaeo_super_prompt.modeling.entity_extractor.fuzzy_match
Identification of thesaurus with fuzzymatching in text chunks.
Functions
-
extended_expression — Return the extended expression around a given match.
-
filter_occurences — Keep the matches whose extended expression still match with the thesarusus value.
-
extract_from_content — We expect the wanted entities and the content to be normalized.
-
normalize_text — Apply simple normalization to make the comparison easier.
-
extract_wanted_entities — Filter only the entities that fuzzymatch with wanted thesaurus.
source extended_expression(content: str, match: Match) → str
Return the extended expression around a given match.
Examples
"WE ARE IN PONTEDERA", "PONTE" -> "PONTEDERA" "WE ARE IN AN APPARTEMENT", "PART" -> "APPARTEMENT" "WE ARE IN AN APPARTEMENT", "APPARTEMENT" -> "APPARTEMENT" "I am working for the Soprintendenza Archeologica della Toscana", "Soprintendenza Archeologica della Toscana" -> "Soprintendenza Archeologica della Toscana" "I am working for the Soprintendenza Archeologica della Toscana", "intendenza Archeologica della Toscana" -> "Soprintendenza Archeologica della Toscana"
source filter_occurences(content: str, thesaurus_value: str, matches: list[Match]) → list[Match]
Keep the matches whose extended expression still match with the thesarusus value.
For example, if "PART" is detected in the content "WE ARE IN AN APPARTEMENT", then this match will be excluded.
source extract_from_content(content: str, entity_set: list[CompleteEntity], wanted_entities: list[tuple[int, str]]) → set[int] | None
We expect the wanted entities and the content to be normalized.
source normalize_text(txt: str) → str
Apply simple normalization to make the comparison easier.
source extract_wanted_entities(chunk_contents: Iterator[str], complete_entity_sets: Iterator[list[CompleteEntity]], thesauri_factory: ThesaurusProvider) → Iterator[set[int] | None]
Filter only the entities that fuzzymatch with wanted thesaurus.
Parameters
-
chunk_contents : Iterator[str] — for each chunk, its text content
-
complete_entity_sets : Iterator[list[CompleteEntity]] — a set for each text chunk of occurring entities only in a group of entity types
-
thesauri_factory : ThesaurusProvider — a set of wanted string values to be extracted in the same group of entity types
ReturnType
A list for each text chunk of the matched thesaurus above the given distance treshold. If there is not any filtered entity for a given chunk, then None is returned for this chunk instead of the empty set. The empty set means that the chunk contains entities that match the group of entities of interests but these entities does not match the thesaurus.