Information Extraction


Prerequisites:

  • Familiarity with OOP principles

  • Understanding of regular expressions.

  • Understanding of evaluation measures for machine learning models.

  • Data Layer of Konfuzio: Label, Annotation, Label Set, Span, Project, Document, Category

  • AI Layer of Konfuzio: Information Extraction

Difficulty: Medium

Goal: Be able to build and deploy custom models for data extraction.


Environment

You need to install the Konfuzio SDK before diving into the tutorial.
To get up and running quickly, you can use our Colab Quick Start notebook.
Open In Colab

As an alternative you can follow the installation section to install and initialize the Konfuzio SDK locally or on an environment of your choice.

Introduction

Information Extraction is the process of obtaining information from the Document’s unstructured text and assigning Labels to it. For example, Labels could be the Name, the Date, the Recipient, or any other field of interest in the Document.

Within Konfuzio, Documents are assigned a Category, which in turn can be associated to one or more Label Set(s) and therefore to a Label. To be precise, it is Label Set(s) that are associated to Categories, and not the other way around.

In this tutorial we will cover the following topics:

  • How to train a custom Extraction AI that can be used with Konfuzio

  • How to evaluate the performance of a trained Extraction AI model

Train a custom date Extraction AI

This section explains how to create a custom Extraction AI locally, how to save it and upload it to the Konfuzio Server.

Any Extraction AI class should derive from the AbstractExtractionAI class and implement the extract() method. In this tutorial, we demonstrate how to create a simple custom Extraction AI that extracts dates provided in a certain format. Note that to enable Labels’ and Label Sets’ dynamic creation during extraction, you need to have Superuser rights and enable dynamic creation in a Superuser Project.

We start by defining a custom class CustomExtractionAI that inherits from AbstractExtractionAI, containing a single method extract that takes a Document object as input and returns a modified Document.

Inside the extract method, the code first calls the parent method super().extract(). This method call retrieves a virtual Document with no Annotations and changes the Category to the one saved within the Extraction AI.

The code checks if a Label named ‘Date’ already exists in the Labels associated with the Category. It then either uses the existing Label, or creates a new one.

We use a regular expression (r'(\d+/\d+/\d+)') to find matches for dates within the text of the Document. This regular expression looks for patterns of digits separated by forward slashes (e.g., 12/31/2023).

For each match found, it creates a Span object representing the start and end offsets of the matched text.

It then creates an Annotation object, associating it with the Document, then loops for each match found in the Document. Note that by default, only the Annotations with confidence higher than 10% will be shown in the extracted Document. This can be changed in the Label settings UI.

import re

from konfuzio_sdk.data import Document, Span, Annotation, Label
from konfuzio_sdk.trainer.information_extraction import AbstractExtractionAI

class CustomExtractionAI(AbstractExtractionAI):
    def extract(self, document: Document) -> Document:
        document = super().extract(document)

        label_set = document.category.default_label_set

        label_name = 'Date'
        if label_name in [label.name for label in document.category.labels]:
            label = document.project.get_label_by_name(label_name)
        else:
            label = Label(text=label_name, project=document.project, label_sets=[label_set])
        annotation_set = document.default_annotation_set
        for re_match in re.finditer(r'(\d+/\d+/\d+)', document.text, flags=re.MULTILINE):
            span = Span(start_offset=re_match.span(1)[0], end_offset=re_match.span(1)[1])

            _ = Annotation(
                document=document,
                label=label,
                annotation_set=annotation_set,
                confidence=1.0,  
                spans=[span],
            )
        return document

We can now use this custom Extraction AI class. Let’s start with making the necessary imports, initializing the Project, the Category and the AI:

import os
from konfuzio_sdk.data import Project

project = Project(id_=TEST_PROJECT_ID)
category = project.get_category_by_id(TEST_PAYSLIPS_CATEGORY_ID)
categorization_pipeline = CustomExtractionAI(category)

Then, create a sample test Document to run the extraction on.

example_text = """
    19/05/1996 is my birthday.
    04/07/1776 is the Independence day.
    """
sample_document = Document(project=project, text=example_text, category=category)
print(sample_document.text)
    19/05/1996 is my birthday.
    04/07/1776 is the Independence day.
    

Run the extraction of a Document and print the extracted Annotations.

extracted = categorization_pipeline.extract(sample_document)
for annotation in extracted.annotations(use_correct=False):
    print(annotation.offset_string)
2024-04-30 22:06:48,569 [konfuzio_sdk.trainer] [MainThread] [INFO    ] [extract             ][0856] Starting extraction of Virtual Document None (None).
2024-04-30 22:06:48,570 [konfuzio_sdk.trainer] [MainThread] [INFO    ] [check_is_ready      ][0840] Checking if <__main__.CustomExtractionAI object at 0x7f6733b90ee0> is ready for extraction.
2024-04-30 22:06:48,571 [konfuzio_sdk.data   ] [MainThread] [INFO    ] [set_category        ][3007] Setting Category of Virtual Document None (None) to Category: Lohnabrechnung (63).
['19/05/1996']
['04/07/1776']

Now we can save the AI and check that it is possible to load it afterwards.

pickle_model_path = categorization_pipeline.save()
extraction_pipeline_loaded = CustomExtractionAI.load_model(pickle_model_path)
2024-04-30 22:06:48,577 [konfuzio_sdk.trainer] [MainThread] [INFO    ] [save                ][0193] Saving model
2024-04-30 22:06:48,578 [konfuzio_sdk.trainer] [MainThread] [INFO    ] [check_is_ready      ][0840] Checking if <__main__.CustomExtractionAI object at 0x7f6733b90ee0> is ready for extraction.
2024-04-30 22:06:48,578 [konfuzio_sdk.trainer] [MainThread] [INFO    ] [save                ][0196] output_dir=None
2024-04-30 22:06:48,579 [konfuzio_sdk.trainer] [MainThread] [INFO    ] [save                ][0197] include_konfuzio=True
2024-04-30 22:06:48,580 [konfuzio_sdk.trainer] [MainThread] [INFO    ] [save                ][0198] reduce_weight=True
2024-04-30 22:06:48,580 [konfuzio_sdk.trainer] [MainThread] [INFO    ] [save                ][0199] keep_documents=False
2024-04-30 22:06:48,581 [konfuzio_sdk.trainer] [MainThread] [INFO    ] [save                ][0200] max_ram=None
2024-04-30 22:06:48,581 [konfuzio_sdk.trainer] [MainThread] [INFO    ] [save                ][0201] self.konfuzio_sdk_version='0.3.3'
2024-04-30 22:06:48,582 [konfuzio_sdk.trainer] [MainThread] [INFO    ] [save                ][0203] Getting save paths
2024-04-30 22:06:48,582 [konfuzio_sdk.trainer] [MainThread] [INFO    ] [save                ][0206] new self.output_dir='data_46/models'
2024-04-30 22:06:48,583 [konfuzio_sdk.trainer] [MainThread] [INFO    ] [save                ][0224] Removing documents before save
2024-04-30 22:06:48,586 [konfuzio_sdk.trainer] [MainThread] [INFO    ] [save                ][0229] Model size: 0.026888 MB
2024-04-30 22:06:48,587 [konfuzio_sdk.trainer] [MainThread] [INFO    ] [save                ][0256] Saving model with cloudpickle
2024-04-30 22:06:48,705 [konfuzio_sdk.trainer] [MainThread] [INFO    ] [save                ][0262] Compressing model with lz4
2024-04-30 22:06:48,707 [konfuzio_sdk.trainer] [MainThread] [INFO    ] [save                ][0278] Deleting temporary cloudpickle file
2024-04-30 22:06:48,708 [konfuzio_sdk.trainer] [MainThread] [INFO    ] [save                ][0283] Model (0.137219 MB) customextractionai was saved to data_46/models/2024-04-30-22-06-48_lohnabrechnung_customextractionai_.pkl.lz4
2024-04-30 22:06:48,708 [konfuzio_sdk.trainer] [MainThread] [INFO    ] [load_model          ][0077] Starting loading AI model with path data_46/models/2024-04-30-22-06-48_lohnabrechnung_customextractionai_.pkl.lz4
2024-04-30 22:06:48,715 [konfuzio_sdk.trainer] [MainThread] [INFO    ] [load_model          ][0110] Loaded AI model trained with Python 3.8.19
2024-04-30 22:06:48,716 [konfuzio_sdk.trainer] [MainThread] [INFO    ] [load_model          ][0112] Loaded AI model trained with Konfuzio SDK version 0.3.3
2024-04-30 22:06:48,716 [konfuzio_sdk.trainer] [MainThread] [INFO    ] [load_model          ][0128] Loading CustomExtractionAI AI model.

The custom Extraction AI we just prepared inherits from AbstractExtractionAI, which in turn inherits from BaseModel. BaseModel provides save method that saves a model into a compressed pickle file that can be directly uploaded to the Konfuzio Server (see Upload Extraction or Category AI to target instance).

Activating the uploaded AI on the web interface will enable the custom pipeline on your self-hosted installation.

Note that if you want to create Labels and Label Sets dynamically (when running the AI, instead of adding them manually on app), you need to enable creating them in the Superuser Project settings if you have the corresponding rights.

If you have the Superuser rights, it is also possible to upload the AI from your local machine using the upload_ai_model() as described in Upload your AI.

The Paragraph Custom Extraction AI

In the Paragraph Tokenizer tutorial, we saw how we can use the Paragraph Tokenizer in detectron mode and with the create_detectron_labels option to segment a Document and create figure, table, list, text and title Annotations.

Here, we will see how we can use the Paragraph Tokenizer to create a Custom Extraction AI. We will create a simple wrapper around the Paragraph Tokenizer. This shows how you can create your own Custom Extraction AI which can be used in Konfuzio on-prem installations or in the Konfuzio Marketplace.

We define a class that inherits from the Konfuzio AbstractExtractionAI class. This class provides the interface that we need to implement for our Custom Extraction AI. All Extraction AI models must inherit from this class.

We need to define what the model needs to be able to run. This will inform the Konfuzio Server what information needs to be made available to the model before running an extraction. If the model only needs text, you can add requires_text = True to make it explicit, but this is the default behavior. If the model requires Page images, you will need to add requires_images = True. Finally, in our case we also need to add requires_segmentation = True to inform the Server that the model needs the visual segmentation information created by the Paragraph Tokenizer in detectron mode.

We initialize the model by calling the __init__ method of the parent class. The only required argument is the Category the Extraction AI will be used with. The Category is the Konfuzio object that contains all the Labels and Label Sets that the model will use to create Annotations. This means that you need to make sure that the Category object contains all the Labels and Label Sets that you need for your model. In our case, we need the figure, table, list, text and title Labels.

The extract method is the core of the Extraction AI. It takes a Document as input and returns a Document with Annotations. Make sure to do a deepcopy of the Document that is passed so that you add the new Annotations to a Virtual Document with no Annotations. The Annotations are created by the model and added to the Document. In our case, we simply call the Paragraph Tokenizer in detectron mode and with the create_detectron_labels option.

The check_is_ready method is used to check if the model is ready to be used. It should return True if the model is ready to extract, and False otherwise. Implementing this method is optional, but it is a good practice to make sure that the model is ready to be used.

from konfuzio_sdk.trainer.information_extraction import AbstractExtractionAI
from konfuzio_sdk.tokenizer.paragraph_and_sentence import ParagraphTokenizer
from konfuzio_sdk.data import Category, Document, Project, Label

class ParagraphExtractionAI(AbstractExtractionAI):
    requires_images = True
    requires_text = True
    requires_segmentation = True

    def __init__(self, category: Category = None, *args, **kwargs,):
        """Initialize ParagraphExtractionAI."""
        super().__init__(category=category, *args, **kwargs)
        self.tokenizer = ParagraphTokenizer(mode='detectron', create_detectron_labels=True)    


    def extract(self, document: Document) -> Document:
        """
        Infer information from a given Document.
        """
        inference_document = super().extract(document)
        inference_document = self.tokenizer.tokenize(inference_document)

        return inference_document

    def check_is_ready(self):
        """
        Check if the ExtractionAI is ready for the inference.
        """
        super().check_is_ready()

        self.project.get_label_by_name('figure')
        self.project.get_label_by_name('table')
        self.project.get_label_by_name('list')
        self.project.get_label_by_name('text')
        self.project.get_label_by_name('title')

        return True        

Now that our custom Extraction AI is ready we can test it. First, we check that the category of interest indeed contains all Labels and create those that do not exist.

project = Project(id_=TEST_PROJECT_ID)
category = project.get_category_by_id(TEST_PAYSLIPS_CATEGORY_ID)

labels = ['figure', 'table', 'list', 'text', 'title']
label_set = project.get_label_set_by_name(category.name) 

for label_name in labels:
    try:
        project.get_label_by_name(label_name)
    except IndexError:
        Label(project=project, text=label_name, label_sets=[label_set])

We can now use our custom extraction model to extract data from a Document.

document = project.get_document_by_id(TEST_DOCUMENT_ID)
paragraph_extraction_ai = ParagraphExtractionAI(category=category)

assert paragraph_extraction_ai.check_is_ready() is True

extracted_document = paragraph_extraction_ai.extract(document)

Let’s see all the created Annotations.

print(extracted_document.annotations(use_correct=False))  
[Annotation (None) figure (60, 84), (85, 177), (181, 351), (352, 434), (438, 561), (608, 669), (717, 791), (883, 920), (922, 1027), (1030, 1043), (1047, 1132), (1164, 1167), (1183, 1210), (1226, 1283), (1374, 1412), (1415, 1419), (1465, 1504), (1507, 1587), (1590, 1605), (1608, 1622), (1624, 1636), (1640, 1757), (1758, 1839), (1841, 1883), (1884, 1932), (2009, 2022), (2024, 2119), (2123, 2248), (2249, 2330), (2342, 2380), (2394, 2441), (2514, 2654), (2655, 2740), (2748, 2811), (2889, 2904), (2906, 3012), (3016, 3121), (3123, 3172), (3174, 3221), (3223, 3272), (3274, 3327), (3351, 3400), (3402, 3456), (3474, 3494), (3496, 3538), (3540, 3583), (3589, 3699), (3701, 3785), (3789, 3906), (3909, 4053), (4056, 4180), (4183, 4321), Annotation (None) table (60, 84), (85, 177), (180, 351), (352, 434), (437, 561), (608, 669), (717, 791), (883, 920), (922, 1027), (1030, 1043), (1047, 1132), (1164, 1167), (1183, 1210), (1226, 1283), (1374, 1412), (1415, 1419), (1465, 1504), (1507, 1587), (1590, 1605), (1608, 1622), (1623, 1636), (1639, 1757), (1758, 1839), (1840, 1883), (1884, 1932), (2009, 2022), (2023, 2119), (2122, 2248), (2249, 2330), (2342, 2380), (2394, 2441), (2513, 2654), (2655, 2740), (2748, 2811), (2889, 2904), (2905, 3012), (3015, 3121), (3122, 3172), (3173, 3221), (3222, 3272), (3273, 3327), (3350, 3400), (3401, 3456), (3474, 3494), (3495, 3538), (3539, 3583), (3588, 3699), (3700, 3785), (3788, 3906), (3907, 4053), (4056, 4180), (4181, 4321), Annotation (None) title (1840, 1883), (1884, 1932), Annotation (None) figure (2009, 2022), (2024, 2119), (2123, 2248), (2249, 2330), (2342, 2380), (2394, 2441), (2514, 2654), (2655, 2740), (2748, 2811), (2889, 2904), (2906, 3012), (3016, 3121), (3123, 3172), (3174, 3221), (3223, 3272), (3274, 3327), (3351, 3400), (3402, 3456), (3474, 3494), (3496, 3538), (3540, 3583), (3589, 3699), (3701, 3785), (3789, 3906), (3909, 4053), (4056, 4180), (4183, 4321), Annotation (None) text (2023, 2048)]

We then save the model as a pickle file, so that we can upload it to the Konfuzio Server:

model_path = paragraph_extraction_ai.save()

You can also upload the model to the Konfuzio app or an on-prem setup.

from konfuzio_sdk.api import upload_ai_model

upload_ai_model(model_path=model_path, ai_type='extraction', category_id=category.id_)

Extraction AI Evaluation

This section assumes you have already trained an Extraction AI model and have the pickle file available. If you have not done so, please first complete this tutorial.

In this example we will see how we can evaluate a trained RFExtractionAI model. The model in the example is trained to extract data from payslip Documents.

Start by loading the model:

from konfuzio_sdk.trainer.information_extraction import RFExtractionAI

pipeline = RFExtractionAI.load_model(MODEL_PATH)

Run the evaluation of the Extraction AI and check the metrics:

evaluation = pipeline.evaluate_full()
print(f"Full evaluation F1 score: {evaluation.f1()}")
print(f"Full evaluation recall: {evaluation.recall()}")
print(f"Full evaluation precision: {evaluation.precision()}")

You can also get the evaluation of the Tokenizer alone:

evaluation = pipeline.evaluate_tokenizer()
print(f"Tokenizer evaluation F1 score: {evaluation.tokenizer_f1()}")

It is also possible to get the evaluation of the Label classifier (given perfect tokenization).

evaluation = pipeline.evaluate_clf()
print(f"Label classifier evaluation F1 score: {evaluation.clf_f1()}")

Lastly, you can get the evaluation of the LabelSet (given perfect Label classification).

evaluation = pipeline.evaluate_clf()
print(f"Label Set evaluation F1 score: {evaluation.f1()}")

Conclusion

This tutorial provided a comprehensive guide to building and deploying custom models for data extraction using Konfuzio. We covered various topics, including training a custom Extraction AI, evaluating model performance, and creating a practical example for extracting dates from Documents.

By following the steps outlined in this tutorial, you should now have the knowledge and tools to develop your own custom Extraction AIs tailored to your specific use cases. Additionally, we explored how to save and upload models to the Konfuzio Server for deployment in a real-world setting.

With this newfound understanding, you can continue to explore and enhance your skills in information extraction, enabling you to extract valuable insights from unstructured text data efficiently and effectively.

What’s next?