Create Regex-based Annotations¶
Prerequisites:
Data Layer concepts of Konfuzio: Project, Document, Annotation, Label, Annotation Set, Label Set
Regular expressions
Difficulty: Easy
Goal: Learn how to create Annotations based on simple regular expression-based logic.
Environment¶
You need to install the Konfuzio SDK before diving into the tutorial.
To get up and running quickly, you can use our Colab Quick Start notebook.
As an alternative you can follow the installation section to install and initialize the Konfuzio SDK locally or on an environment of your choice.
Introduction¶
In this guide, we’ll show you how to use Python and regular expressions (regex) to automatically identify and annotate specific text patterns within a Document.
Initialize a Project, define searched term and a Label¶
Let’s say we have a Document, and we want to highlight every instance of the term “Musterstraße”, which might represent a specific street name or location. Our task is to find this term, label it as “Lohnart”, and associate it with the “Brutto-Bezug” Label Set.
import re
from konfuzio_sdk.data import Project, Annotation, Span, AnnotationSet
my_project = Project(id_=YOUR_PROJECT_ID, update=True)
input_expression = "Musterstraße"
label_name = "Lohnart"
my_label = my_project.get_label_by_name(label_name)
label_set = my_label.label_sets[0]
print(my_label)
print(label_set)
Label: Lohnart
LabelSet: Brutto-Bezug (94032)
Get a Document and find matches of a string in it¶
We fetch the first Document in the Project and search for the matches of the word/expression in the Document.
document = my_project.documents[0]
matches_locations = [(m.start(0), m.end(0)) for m in re.finditer(input_expression, document.text)]
print(matches_locations)
[(1590, 1602)]
Create the Annotations¶
For each found match we create an Annotation. Note that no Annotation can exist outside the Annotation Set and every Annotation Set has to contain at least one Annotation.
By using Annotation.save()
we ensure that each Annotation is saved online.
new_annotations_links = []
for offsets in matches_locations:
span = Span(start_offset=offsets[0], end_offset=offsets[1])
annotation_set = AnnotationSet(document=document, label_set=label_set)
annotation_obj = Annotation(
document=document,
annotation_set=annotation_set,
label_set_id=label_set.id_,
label=my_label,
confidence=1.0,
spans=[span],
is_correct=True,
)
new_annotation_added = annotation_obj.save(label_set_id=label_set.id_)
if new_annotation_added:
new_annotations_links.append(annotation_obj.get_link())
# if you want to remove the Annotation and ensure it's deleted online, you can use the following:
annotation_obj.delete(delete_online=True)
Conclusion¶
In this tutorial, we have walked through the essential steps for creating regex-based Annotations. Below is the full code to accomplish this task:
import re
from konfuzio_sdk.data import Project, Annotation, Span, AnnotationSet
my_project = Project(id_=YOUR_PROJECT_ID)
input_expression = "Musterstraße"
label_name = "Lohnart"
my_label = my_project.get_label_by_name(label_name)
label_set = my_label.label_sets[0]
document = my_project.documents[0]
matches_locations = [(m.start(0), m.end(0)) for m in re.finditer(input_expression, document.text)]
new_annotations_links = []
for offsets in matches_locations:
span = Span(start_offset=offsets[0], end_offset=offsets[1])
annotation_set = AnnotationSet(document=document, label_set=label_set)
annotation_obj = Annotation(
document=document,
annotation_set=annotation_set,
label_set_id=label_set.id_,
label=my_label,
confidence=1.0,
spans=[span],
is_correct=True,
)
new_annotation_added = annotation_obj.save(label_set_id=label_set.id_)
if new_annotation_added:
new_annotations_links.append(annotation_obj.get_link())
# if you want to remove the Annotation and ensure it's deleted online, you can use the following:
annotation_obj.delete(delete_online=True)