{ "cells": [ { "cell_type": "markdown", "id": "9a1170b1", "metadata": {}, "source": [ "(info-extraction)=\n", "## Information Extraction\n", "\n", "---\n", "\n", "**Prerequisites:**\n", "- Familiarity with OOP principles\n", "- Understanding of regular expressions.\n", "- Understanding of evaluation measures for machine learning models.\n", "- Data Layer of Konfuzio: Label, Annotation, Label Set, Span, Project, Document, Category\n", "- AI Layer of Konfuzio: Information Extraction\n", "\n", "**Difficulty:** Medium\n", "\n", "**Goal:** Be able to build and deploy custom models for data extraction.\n", "\n", "---\n", "\n", "### Environment\n", "You need to install the Konfuzio SDK before diving into the tutorial. \\\n", "To get up and running quickly, you can use our Colab Quick Start notebook. \\\n", "\"Open\n", "\n", "As an alternative you can follow the [installation section](../get_started.html#install-sdk) to install and initialize the Konfuzio SDK locally or on an environment of your choice.\n", "\n", "### Introduction\n", "Information Extraction is the process of obtaining information from the Document's unstructured text and assigning Labels to it. For example, Labels could be the Name, the Date, the Recipient, or any other field of interest in the Document.\n", "\n", "Within Konfuzio, Documents are assigned a Category, which in turn can be associated to one or more Label Set(s) and therefore to a Label. To be precise, it is Label Set(s) that are associated to Categories, and not the other way around.\n", "\n", "In this tutorial we will cover the following topics:\n", "- How to train a custom Extraction AI that can be used with Konfuzio\n", "- How to evaluate the performance of a trained Extraction AI model" ] }, { "cell_type": "markdown", "id": "2c93cb95", "metadata": {}, "source": [ "### Train a custom date Extraction AI\n", "This section explains how to create a custom Extraction AI locally, how to save it and upload it to the Konfuzio Server.\n", "\n", "Any Extraction AI class should derive from the `AbstractExtractionAI` class and implement the `extract()` method. In this tutorial, we demonstrate how to create a simple custom Extraction AI that extracts dates provided in \n", "a certain format. Note that to enable Labels' and Label Sets' dynamic creation during extraction, you need to have Superuser rights and enable _dynamic creation_ in a [Superuser Project](https://help.konfuzio.com/modules/administration/superuserprojects/index.html#create-labels-and-label-sets).\n", "\n", "We start by defining a custom class CustomExtractionAI that inherits from AbstractExtractionAI, containing a single method extract that takes a Document object as input and returns a modified Document.\n", "\n", "Inside the `extract` method, the code first calls the parent method `super().extract()`. This method call retrieves a virtual Document with no Annotations and changes the Category to the one saved within the Extraction AI.\n", "\n", "The code checks if a Label named 'Date' already exists in the Labels associated with the Category. It then either uses the existing Label, or creates a new one.\n", "\n", "We use a regular expression (`r'(\\d+/\\d+/\\d+)'`) to find matches for dates within the text of the Document. This regular expression looks for patterns of digits separated by forward slashes (e.g., 12/31/2023).\n", "\n", "For each match found, it creates a Span object representing the start and end offsets of the matched text.\n", "\n", "It then creates an Annotation object, associating it with the Document, then loops for each match found in the Document. Note that by default, only the Annotations with confidence higher than 10% will be shown in the extracted Document. This can be changed in the Label settings UI." ] }, { "cell_type": "code", "execution_count": null, "id": "41676ab6", "metadata": {}, "outputs": [], "source": [ "import re\n", "\n", "from konfuzio_sdk.data import Document, Span, Annotation, Label\n", "from konfuzio_sdk.trainer.information_extraction import AbstractExtractionAI\n", "\n", "class CustomExtractionAI(AbstractExtractionAI):\n", " def extract(self, document: Document) -> Document:\n", " document = super().extract(document)\n", "\n", " label_set = document.category.default_label_set\n", "\n", " label_name = 'Date'\n", " if label_name in [label.name for label in document.category.labels]:\n", " label = document.project.get_label_by_name(label_name)\n", " else:\n", " label = Label(text=label_name, project=document.project, label_sets=[label_set])\n", " annotation_set = document.default_annotation_set\n", " for re_match in re.finditer(r'(\\d+/\\d+/\\d+)', document.text, flags=re.MULTILINE):\n", " span = Span(start_offset=re_match.span(1)[0], end_offset=re_match.span(1)[1])\n", "\n", " _ = Annotation(\n", " document=document,\n", " label=label,\n", " annotation_set=annotation_set,\n", " confidence=1.0, \n", " spans=[span],\n", " )\n", " return document" ] }, { "cell_type": "markdown", "id": "7e612096", "metadata": {}, "source": [ "We can now use this custom Extraction AI class. Let's start with making the necessary imports, initializing the Project, the Category and the AI:" ] }, { "cell_type": "code", "execution_count": null, "id": "c80ecf9d", "metadata": { "tags": [ "remove-cell" ] }, "outputs": [], "source": [ "# This is necessary to make sure we can import from 'tests'\n", "import sys\n", "sys.path.insert(0, '../../../../')\n", "\n", "from tests.variables import TEST_PROJECT_ID, TEST_PAYSLIPS_CATEGORY_ID, TEST_DOCUMENT_ID" ] }, { "cell_type": "code", "execution_count": null, "id": "06fb7fb6", "metadata": { "tags": [ "remove-output" ] }, "outputs": [], "source": [ "import os\n", "from konfuzio_sdk.data import Project\n", "\n", "project = Project(id_=TEST_PROJECT_ID)\n", "category = project.get_category_by_id(TEST_PAYSLIPS_CATEGORY_ID)\n", "categorization_pipeline = CustomExtractionAI(category)" ] }, { "cell_type": "markdown", "id": "3e475dce", "metadata": { "lines_to_next_cell": 0 }, "source": [ "Then, create a sample test Document to run the extraction on." ] }, { "cell_type": "code", "execution_count": null, "id": "5c4e6b87", "metadata": {}, "outputs": [], "source": [ "example_text = \"\"\"\n", " 19/05/1996 is my birthday.\n", " 04/07/1776 is the Independence day.\n", " \"\"\"\n", "sample_document = Document(project=project, text=example_text, category=category)\n", "print(sample_document.text)" ] }, { "cell_type": "markdown", "id": "1e55fcf1", "metadata": { "lines_to_next_cell": 0 }, "source": [ "Run the extraction of a Document and print the extracted Annotations." ] }, { "cell_type": "code", "execution_count": null, "id": "b60254f7", "metadata": {}, "outputs": [], "source": [ "extracted = categorization_pipeline.extract(sample_document)\n", "for annotation in extracted.annotations(use_correct=False):\n", " print(annotation.offset_string)" ] }, { "cell_type": "markdown", "id": "8b417035", "metadata": { "lines_to_next_cell": 0 }, "source": [ "Now we can save the AI and check that it is possible to load it afterwards." ] }, { "cell_type": "code", "execution_count": null, "id": "78df5c26", "metadata": {}, "outputs": [], "source": [ "pickle_model_path = categorization_pipeline.save()\n", "extraction_pipeline_loaded = CustomExtractionAI.load_model(pickle_model_path)" ] }, { "cell_type": "markdown", "id": "dddde5cc", "metadata": {}, "source": [ "The custom Extraction AI we just prepared inherits from AbstractExtractionAI, which in turn inherits from [BaseModel](sourcecode.html#base-model). `BaseModel` provides `save` method that saves a model into a compressed pickle file that can be directly uploaded to the Konfuzio Server (see [Upload Extraction or Category AI to target instance](https://help.konfuzio.com/tutorials/migrate-trained-ai-to-an-new-project-to-annotate-documents-faster/index.html#upload-extraction-or-category-ai-to-target-instance)).\n", "\n", "Activating the uploaded AI on the web interface will enable the custom pipeline on your self-hosted installation.\n", "\n", "Note that if you want to create Labels and Label Sets dynamically (when running the AI, instead of adding them manually\n", "on app), you need to enable creating them in the Superuser Project settings if you have the corresponding rights.\n", "\n", "If you have the Superuser rights, it is also possible to upload the AI from your local machine using the \n", "`upload_ai_model()` as described in [Upload your AI](https://dev.konfuzio.com/sdk/tutorials/upload-your-ai/index.html)." ] }, { "cell_type": "markdown", "id": "bdfdbe18", "metadata": {}, "source": [ "### The Paragraph Custom Extraction AI\n", "In [the Paragraph Tokenizer tutorial](https://dev.konfuzio.com/sdk/tutorials/tokenizers/index.html#paragraph-tokenization), we saw how we can use the Paragraph Tokenizer in `detectron` mode and with the `create_detectron_labels` option to segment a Document and create `figure`, `table`, `list`, `text` and `title` Annotations.\n", "\n", "Here, we will see how we can use the Paragraph Tokenizer to create a Custom Extraction AI. We will create a simple wrapper around the Paragraph Tokenizer. This shows how you can create your own Custom Extraction AI which \n", "can be used in Konfuzio on-prem installations or in the [Konfuzio Marketplace](https://help.konfuzio.com/modules/marketplace/index.html)." ] }, { "cell_type": "markdown", "id": "bc727a52", "metadata": {}, "source": [ "We define a class that inherits from the Konfuzio `AbstractExtractionAI` class. This class provides the interface that we need to implement for our Custom Extraction AI. All Extraction AI models must inherit from this class.\n", "\n", "We need to define what the model needs to be able to run. This will inform the Konfuzio Server what information needs to be made available to the model before running an extraction. If the model only needs text, you can add `requires_text = True` to make it explicit, but this is the default behavior. If the model requires Page images, you will need to add `requires_images = True`. Finally, in our case we also need to add `requires_segmentation = True` to inform the Server that the model needs the visual segmentation information created by the Paragraph Tokenizer in `detectron` mode.\n", "\n", "We initialize the model by calling the `__init__` method of the parent class. The only required argument is the Category the Extraction AI will be used with. The Category is the Konfuzio object that contains all the Labels \n", "and Label Sets that the model will use to create Annotations. This means that you need to make sure that the Category object contains all the Labels and Label Sets that you need for your model. In our case, we need the `figure`, `table`, `list`, `text` and `title` Labels.\n", "\n", "The `extract` method is the core of the Extraction AI. It takes a Document as input and returns a Document with Annotations. Make sure to do a `deepcopy` of the Document that is passed so that you add the new Annotations to a \n", "Virtual Document with no Annotations. The Annotations are created by the model and added to the Document. In our case, we simply call the Paragraph Tokenizer in `detectron` mode and with the `create_detectron_labels` option.\n", "\n", "The `check_is_ready` method is used to check if the model is ready to be used. It should return `True` if the model is ready to extract, and `False` otherwise. Implementing this method is optional, but it is a good practice to make sure that the model is ready to be used." ] }, { "cell_type": "code", "execution_count": null, "id": "3de39cf8", "metadata": {}, "outputs": [], "source": [ "from konfuzio_sdk.trainer.information_extraction import AbstractExtractionAI\n", "from konfuzio_sdk.tokenizer.paragraph_and_sentence import ParagraphTokenizer\n", "from konfuzio_sdk.data import Category, Document, Project, Label\n", "\n", "class ParagraphExtractionAI(AbstractExtractionAI):\n", " requires_images = True\n", " requires_text = True\n", " requires_segmentation = True\n", "\n", " def __init__(self, category: Category = None, *args, **kwargs,):\n", " \"\"\"Initialize ParagraphExtractionAI.\"\"\"\n", " super().__init__(category=category, *args, **kwargs)\n", " self.tokenizer = ParagraphTokenizer(mode='detectron', create_detectron_labels=True) \n", "\n", "\n", " def extract(self, document: Document) -> Document:\n", " \"\"\"\n", " Infer information from a given Document.\n", " \"\"\"\n", " inference_document = super().extract(document)\n", " inference_document = self.tokenizer.tokenize(inference_document)\n", "\n", " return inference_document\n", "\n", " def check_is_ready(self):\n", " \"\"\"\n", " Check if the ExtractionAI is ready for the inference.\n", " \"\"\"\n", " super().check_is_ready()\n", "\n", " self.project.get_label_by_name('figure')\n", " self.project.get_label_by_name('table')\n", " self.project.get_label_by_name('list')\n", " self.project.get_label_by_name('text')\n", " self.project.get_label_by_name('title')\n", "\n", " return True " ] }, { "cell_type": "markdown", "id": "53958654", "metadata": {}, "source": [ "Now that our custom Extraction AI is ready we can test it. First, we check that the category of interest indeed contains all Labels and create those that do not exist." ] }, { "cell_type": "code", "execution_count": null, "id": "5d23212e", "metadata": { "tags": [ "remove-output" ] }, "outputs": [], "source": [ "project = Project(id_=TEST_PROJECT_ID)\n", "category = project.get_category_by_id(TEST_PAYSLIPS_CATEGORY_ID)\n", "\n", "labels = ['figure', 'table', 'list', 'text', 'title']\n", "label_set = project.get_label_set_by_name(category.name) \n", "\n", "for label_name in labels:\n", " try:\n", " project.get_label_by_name(label_name)\n", " except IndexError:\n", " Label(project=project, text=label_name, label_sets=[label_set])" ] }, { "cell_type": "markdown", "id": "b3b90478", "metadata": {}, "source": [ "We can now use our custom extraction model to extract data from a Document. " ] }, { "cell_type": "code", "execution_count": null, "id": "f3272260", "metadata": { "tags": [ "remove-output" ] }, "outputs": [], "source": [ "document = project.get_document_by_id(TEST_DOCUMENT_ID)\n", "paragraph_extraction_ai = ParagraphExtractionAI(category=category)\n", "\n", "assert paragraph_extraction_ai.check_is_ready() is True\n", "\n", "extracted_document = paragraph_extraction_ai.extract(document)" ] }, { "cell_type": "markdown", "id": "5466e914", "metadata": { "lines_to_next_cell": 0 }, "source": [ "Let's see all the created Annotations." ] }, { "cell_type": "code", "execution_count": null, "id": "a1814641", "metadata": {}, "outputs": [], "source": [ "print(extracted_document.annotations(use_correct=False)) " ] }, { "cell_type": "markdown", "id": "14b1cd9a", "metadata": {}, "source": [ "We then save the model as a pickle file, so that we can upload it to the Konfuzio Server:" ] }, { "cell_type": "code", "execution_count": null, "id": "82bf0872", "metadata": { "tags": [ "remove-output" ] }, "outputs": [], "source": [ "model_path = paragraph_extraction_ai.save()" ] }, { "cell_type": "markdown", "id": "15963f67", "metadata": {}, "source": [ "You can also upload the model to the Konfuzio app or an on-prem setup." ] }, { "cell_type": "code", "execution_count": null, "id": "7506d6c3", "metadata": { "tags": [ "skip-execution", "nbval-skip" ] }, "outputs": [], "source": [ "from konfuzio_sdk.api import upload_ai_model\n", "\n", "upload_ai_model(model_path=model_path, ai_type='extraction', category_id=category.id_)" ] }, { "cell_type": "markdown", "id": "833abd20", "metadata": {}, "source": [ "### Extraction AI Evaluation\n", "\n", "This section assumes you have already trained an Extraction AI model and have the pickle file available. If you have not done so, please first complete [this](/sdk/tutorials/rf-extraction-ai/) tutorial.\n", "\n", "In this example we will see how we can evaluate a trained `RFExtractionAI` model. The model in the example is trained to extract data from payslip Documents. \n", "\n", "Start by loading the model:" ] }, { "cell_type": "code", "execution_count": null, "id": "78543cb8", "metadata": { "tags": [ "skip-execution", "nbval-skip" ] }, "outputs": [], "source": [ "from konfuzio_sdk.trainer.information_extraction import RFExtractionAI\n", "\n", "pipeline = RFExtractionAI.load_model(MODEL_PATH)" ] }, { "cell_type": "markdown", "id": "9af3e4e6", "metadata": { "lines_to_next_cell": 0 }, "source": [ "Run the evaluation of the Extraction AI and check the metrics:" ] }, { "cell_type": "code", "execution_count": null, "id": "736eed6f", "metadata": { "tags": [ "skip-execution", "nbval-skip" ] }, "outputs": [], "source": [ "evaluation = pipeline.evaluate_full()\n", "print(f\"Full evaluation F1 score: {evaluation.f1()}\")\n", "print(f\"Full evaluation recall: {evaluation.recall()}\")\n", "print(f\"Full evaluation precision: {evaluation.precision()}\")" ] }, { "cell_type": "markdown", "id": "45391815", "metadata": { "lines_to_next_cell": 0 }, "source": [ "You can also get the evaluation of the Tokenizer alone:" ] }, { "cell_type": "code", "execution_count": null, "id": "e16d6985", "metadata": { "tags": [ "skip-execution", "nbval-skip" ] }, "outputs": [], "source": [ "evaluation = pipeline.evaluate_tokenizer()\n", "print(f\"Tokenizer evaluation F1 score: {evaluation.tokenizer_f1()}\")" ] }, { "cell_type": "markdown", "id": "9e529a3a", "metadata": { "lines_to_next_cell": 0 }, "source": [ "It is also possible to get the evaluation of the Label classifier (given perfect tokenization)." ] }, { "cell_type": "code", "execution_count": null, "id": "8b182c3f", "metadata": { "tags": [ "skip-execution", "nbval-skip" ] }, "outputs": [], "source": [ "evaluation = pipeline.evaluate_clf()\n", "print(f\"Label classifier evaluation F1 score: {evaluation.clf_f1()}\")" ] }, { "cell_type": "markdown", "id": "2b305d4e", "metadata": { "lines_to_next_cell": 0 }, "source": [ "Lastly, you can get the evaluation of the LabelSet (given perfect Label classification)." ] }, { "cell_type": "code", "execution_count": null, "id": "d6a4b124", "metadata": { "tags": [ "skip-execution", "nbval-skip" ] }, "outputs": [], "source": [ "evaluation = pipeline.evaluate_clf()\n", "print(f\"Label Set evaluation F1 score: {evaluation.f1()}\")" ] }, { "cell_type": "markdown", "id": "e0e3f574", "metadata": {}, "source": [ "### Conclusion\n", "This tutorial provided a comprehensive guide to building and deploying custom models for data extraction using Konfuzio. We covered various topics, including training a custom Extraction AI, evaluating model performance, and creating a practical example for extracting dates from Documents.\n", "\n", "By following the steps outlined in this tutorial, you should now have the knowledge and tools to develop your own custom Extraction AIs tailored to your specific use cases. Additionally, we explored how to save and upload models to the Konfuzio Server for deployment in a real-world setting.\n", "\n", "With this newfound understanding, you can continue to explore and enhance your skills in information extraction, enabling you to extract valuable insights from unstructured text data efficiently and effectively. " ] }, { "cell_type": "markdown", "id": "6d73c53c", "metadata": {}, "source": [ "### What's next?\n", "- Learn how to upload your custom extraction model\n", "- Pull Documents Uploaded Asynchronously with a Webhook" ] } ], "metadata": { "kernelspec": { "display_name": "konfuzio", "language": "python", "name": "python3" } }, "nbformat": 4, "nbformat_minor": 5 }