{ "cells": [ { "cell_type": "markdown", "id": "f882a68f", "metadata": {}, "source": [ "## Create Regex-based Annotations\n", "\n", "---\n", "\n", "**Prerequisites:** \n", "\n", "- Data Layer concepts of Konfuzio: Project, Document, Annotation, Label, Annotation Set, Label Set\n", "- Regular expressions\n", "\n", "**Difficulty:** Easy\n", "\n", "**Goal:** Learn how to create Annotations based on simple regular expression-based logic.\n", "\n", "---\n", "\n", "### Environment\n", "You need to install the Konfuzio SDK before diving into the tutorial. \\\n", "To get up and running quickly, you can use our Colab Quick Start notebook. \\\n", "\"Open\n", "\n", "As an alternative you can follow the [installation section](../get_started.html#install-sdk) to install and initialize the Konfuzio SDK locally or on an environment of your choice.\n", "\n", "### Introduction\n", "\n", "In this guide, we'll show you how to use Python and regular expressions (regex) to automatically identify and annotate specific text patterns within a Document. \n", "\n", "### Initialize a Project, define searched term and a Label\n", "\n", "Let's say we have a Document, and we want to highlight every instance of the term \"Musterstraße\", which might represent a specific street name or location. Our task is to find this term, label it as \"Lohnart\", and associate it with the \"Brutto-Bezug\" Label Set." ] }, { "cell_type": "code", "execution_count": null, "id": "8db0a22f", "metadata": { "editable": true, "slideshow": { "slide_type": "" }, "tags": [ "remove-cell" ], "vscode": { "languageId": "plaintext" } }, "outputs": [], "source": [ "import logging\n", "from konfuzio_sdk.api import get_project_list\n", "from tests.variables import TEST_SNAPSHOT_ID\n", "\n", "logging.getLogger(\"konfuzio_sdk\").setLevel(logging.ERROR)\n", "projects = get_project_list()\n", "# we want to get the last instance of a project restored from a snapshot because creating a new one each time takes longer \n", "YOUR_PROJECT_ID = next(project['id'] for project in reversed(projects) if TEST_SNAPSHOT_ID in project['name'])" ] }, { "cell_type": "code", "execution_count": null, "id": "9a9042c3", "metadata": { "editable": true, "slideshow": { "slide_type": "" }, "vscode": { "languageId": "plaintext" } }, "outputs": [], "source": [ "import re\n", "from konfuzio_sdk.data import Project, Annotation, Span, AnnotationSet\n", "\n", "my_project = Project(id_=YOUR_PROJECT_ID, update=True)\n", "input_expression = \"Musterstraße\"\n", "label_name = \"Lohnart\"\n", "\n", "my_label = my_project.get_label_by_name(label_name)\n", "label_set = my_label.label_sets[0]\n", "print(my_label)\n", "print(label_set)" ] }, { "cell_type": "markdown", "id": "6d9ff21c", "metadata": {}, "source": [ "### Get a Document and find matches of a string in it\n", "\n", "We fetch the first Document in the Project and search for the matches of the word/expression in the Document." ] }, { "cell_type": "code", "execution_count": null, "id": "1c2efd5b", "metadata": { "editable": true, "slideshow": { "slide_type": "" } }, "outputs": [], "source": [ "document = my_project.documents[0]\n", "\n", "matches_locations = [(m.start(0), m.end(0)) for m in re.finditer(input_expression, document.text)]\n", "print(matches_locations)" ] }, { "cell_type": "markdown", "id": "3402c297", "metadata": {}, "source": [ "### Create the Annotations\n", "\n", "For each found match we create an Annotation. Note that no Annotation can exist outside the Annotation Set and every Annotation Set has to contain at least one Annotation.\n", "By using `Annotation.save()` we ensure that each Annotation is saved online." ] }, { "cell_type": "code", "execution_count": null, "id": "4dd09509", "metadata": { "editable": true, "slideshow": { "slide_type": "" }, "tags": [ "remove-output" ] }, "outputs": [], "source": [ "new_annotations_links = []\n", "\n", "for offsets in matches_locations:\n", " span = Span(start_offset=offsets[0], end_offset=offsets[1])\n", " annotation_set = AnnotationSet(document=document, label_set=label_set)\n", " annotation_obj = Annotation(\n", " document=document,\n", " annotation_set=annotation_set,\n", " label_set_id=label_set.id_,\n", " label=my_label,\n", " confidence=1.0,\n", " spans=[span],\n", " is_correct=True,\n", " )\n", " \n", " new_annotation_added = annotation_obj.save(label_set_id=label_set.id_)\n", " if new_annotation_added:\n", " new_annotations_links.append(annotation_obj.get_link())\n", " # if you want to remove the Annotation and ensure it's deleted online, you can use the following:\n", " annotation_obj.delete(delete_online=True)" ] }, { "cell_type": "markdown", "id": "005ac907", "metadata": {}, "source": [ "### Conclusion\n", "In this tutorial, we have walked through the essential steps for creating regex-based Annotations. Below is the full code to accomplish this task:" ] }, { "cell_type": "code", "execution_count": null, "id": "9c763a16", "metadata": { "editable": true, "slideshow": { "slide_type": "" }, "tags": [ "skip-execution", "skip-nbeval" ], "vscode": { "languageId": "plaintext" } }, "outputs": [], "source": [ "import re\n", "from konfuzio_sdk.data import Project, Annotation, Span, AnnotationSet\n", "\n", "my_project = Project(id_=YOUR_PROJECT_ID)\n", "\n", "input_expression = \"Musterstraße\"\n", "label_name = \"Lohnart\"\n", "\n", "my_label = my_project.get_label_by_name(label_name)\n", "\n", "label_set = my_label.label_sets[0]\n", "\n", "document = my_project.documents[0]\n", "\n", "matches_locations = [(m.start(0), m.end(0)) for m in re.finditer(input_expression, document.text)]\n", "\n", "new_annotations_links = []\n", "\n", "for offsets in matches_locations:\n", " span = Span(start_offset=offsets[0], end_offset=offsets[1])\n", " annotation_set = AnnotationSet(document=document, label_set=label_set)\n", " annotation_obj = Annotation(\n", " document=document,\n", " annotation_set=annotation_set,\n", " label_set_id=label_set.id_,\n", " label=my_label,\n", " confidence=1.0,\n", " spans=[span],\n", " is_correct=True,\n", " )\n", " \n", " new_annotation_added = annotation_obj.save(label_set_id=label_set.id_)\n", " if new_annotation_added:\n", " new_annotations_links.append(annotation_obj.get_link())\n", " # if you want to remove the Annotation and ensure it's deleted online, you can use the following:\n", " annotation_obj.delete(delete_online=True)" ] }, { "cell_type": "markdown", "id": "99d43381", "metadata": {}, "source": [ "### What's next?\n", "\n", "- [Learn how to create Annotations automatically using Extraction AI](https://dev.konfuzio.com/sdk/tutorials/information_extraction/index.html)\n", "- [Get to know how to visualize created Annotations](https://dev.konfuzio.com/sdk/explanations.html#coordinates-system)" ] } ], "metadata": { "kernelspec": { "display_name": "Python 3 (ipykernel)", "language": "python", "name": "python3" } }, "nbformat": 4, "nbformat_minor": 5 }