Document Information Extraction

Information Extraction is a process of obtaining information from the Document’s unstructured text and labelling it with Labels like Name, Date, Recipient, or any other custom Labels. The result of Extraction looks like this:

../../../_images/example.png

Information Extraction always happens at a Category level, that is, operates under a single Category.

Train a custom Extraction AI

This section explains how to create a custom Extraction AI locally, how to save it and upload it to the Konfuzio Server.

Note: you don’t necessarily need to create the AI from scratch if you already have some document-processing architecture. You just need to wrap it into the class that corresponds to our Extraction AI structure. Follow the steps in this tutorial to find out what are the requirements for that.

To prepare the data for training or testing your AI, you can follow the data preparation tutorial.

By default, any Extraction AI class should derive from the AbstractExtractionAI class and implement the extract() method. In this tutorial, we’ll demonstrate how to create a simple custom Extraction AI that extracts dates provided in a certain format. Note that to enable Labels’ and Label Sets’ dynamic creation during extraction, you need to have Superuser rights and enable this setting in a Superuser Project.

import re

from konfuzio_sdk.data import Document, Span, Annotation, Label
from konfuzio_sdk.trainer.information_extraction import AbstractExtractionAI

class CustomExtractionAI(AbstractExtractionAI):
    def extract(self, document: Document) -> Document:
        """Extract regex matches for dates."""
        # call the parent method to get a Virtual Document with no Annotations and with the Category changed to the
        # one saved within the Extraction AI
        document = super().extract(document)

        # define a Label Set that will contain Labels for Annotations your Extraction AI extracts
        # here we use the default Label Set of the Category
        label_set = document.category.default_label_set
        # get or create a Label that will be used for annotating
        label_name = 'Date'
        if label_name in [label.name for label in document.category.labels]:
            label = document.project.get_label_by_name(label_name)
        else:
            label = Label(text=label_name, project=project, label_sets=[label_set])
        annotation_set = document.default_annotation_set
        for re_match in re.finditer(r'(\d+/\d+/\d+)', document.text, flags=re.MULTILINE):
            span = Span(start_offset=re_match.span(1)[0], end_offset=re_match.span(1)[1])
            # create Annotation Set for the Annotation. Note that every Annotation Set
            # has to contain at least one Annotation, and Annotation always should be
            # a part of an Annotation Set.
            _ = Annotation(
                document=document,
                label=label,
                annotation_set=annotation_set,
                confidence=1.0,  # note that by default, only the Annotations with confidence higher than 10%
                # will be shown in the extracted Document. This can be changed in the Label settings UI.
                spans=[span],
            )
        return document

Example usage of your Custom Extraction AI:

import os
from konfuzio_sdk.data import Project

# Initialize Project and provide the AI training and test data
project = Project(id_=YOUR_PROJECT_ID)  # see https://dev.konfuzio.com/sdk/get_started.html#example-usage
category = project.get_category_by_id(YOUR_CATEGORY_ID)
categorization_pipeline = CustomExtractionAI(category)

# Create a sample test Document to run extraction on
example_text = """
    19/05/1996 is my birthday.
    04/07/1776 is the Independence day.
    """
sample_document = Document(project=project, text=example_text, category=category)

# Extract a Document
extracted = categorization_pipeline.extract(sample_document)
# we set use_correct=False because we didn't change the default flag is_correct=False upon creating the Annotations
assert len(extracted.annotations(use_correct=False)) == 2

# Save and load a pickle file for the model
pickle_model_path = categorization_pipeline.save()
extraction_pipeline_loaded = CustomExtractionAI.load_model(pickle_model_path)

The custom AI inherits from AbstractExtractionAI, which in turn inherits from BaseModel. The inheritance can be seen on a scheme below:

graph LR; BaseModel[BaseModel] --> AbstractExtractionAI[AbstractExtractionAI] --> CustomExtractionAI[CustomExtractionAI];

BaseModel provides save method that saves a model into a compressed pickle file that can be directly uploaded to the Konfuzio Server (see Upload Extraction or Category AI to target instance).

Activating the uploaded AI on the web interface will enable the custom pipeline on your self-hosted installation.

Note that if you want to create Labels and Label Sets dynamically (when running the AI, instead of adding them manually on app), you need to enable creating them in the Superuser Project settings if you have the corresponding rights.

If you have the Superuser rights, it is also possible to upload the AI from your local machine using the upload_ai_model() as described in Upload your AI.

Example of Custom Extraction AI: The Paragraph Extraction AI

In the Paragraph Tokenizer tutorial, we saw how we can use the Paragraph Tokenizer in detectron mode and with the create_detectron_labels option to segment a Document and create figure, table, list, text and title Annotations. The tokenizer used this way is thus able to create Annotations like in the following:

../../../_images/paragraph_tokenizer.png

Here we will see how we can use the Paragraph Tokenizer to create a Custom Extraction AI. What we need to create is just a simple wrapper around the Paragraph Tokenizer. It shows how you can create your own Custom Extraction AI that you can use in Konfuzio on-prem installations or in the Konfuzio Marketplace.

Full Paragraph Extraction AI code
class ParagraphExtractionAI(AbstractExtractionAI):
    """Extract and label text regions using Detectron2."""

    # start model requirements
    requires_images = True
    requires_text = True
    requires_segmentation = True
    # end model requirements

    def __init__(
        self,
        category: Category = None,
        *args,
        **kwargs,
    ):
        """Initialize ParagraphExtractionAI."""
        logger.info("Initializing ParagraphExtractionAI.")
        super().__init__(category=category, *args, **kwargs)
        self.tokenizer = ParagraphTokenizer(mode='detectron', create_detectron_labels=True)

    def extract(self, document: Document) -> Document:
        """
        Infer information from a given Document.

        :param document: Document object
        :return: Document with predicted Labels

        :raises:
        AttributeError: When missing a Tokenizer
        """
        inference_document = super().extract(document)

        inference_document = self.tokenizer.tokenize(inference_document)

        return inference_document

    def check_is_ready(self):
        """
        Check if the ExtractionAI is ready for the inference.

        :raises AttributeError: When no Category is specified.
        :raises IndexError: When the Category does not contain the required Labels.
        """
        super().check_is_ready()

        self.project.get_label_by_name('figure')
        self.project.get_label_by_name('table')
        self.project.get_label_by_name('list')
        self.project.get_label_by_name('text')
        self.project.get_label_by_name('title')

        return True


Let’s go step by step.

  1. Imports

    from konfuzio_sdk.trainer.information_extraction import AbstractExtractionAI
    from konfuzio_sdk.tokenizer.paragraph_and_sentence import ParagraphTokenizer
    from konfuzio_sdk.data import Category, Document, Project, Label
    
    
  2. Custom Extraction AI model definition

    class ParagraphExtractionAI(AbstractExtractionAI):
    

    We define a class that inherits from the Konfuzio AbstractExtractionAI class. This class provides the interface that we need to implement for our Custom Extraction AI. All Extraction AI models must inherit from this class.

  3. Add model requirements

        requires_images = True
        requires_text = True
        requires_segmentation = True
    

    We need to define what the model needs to be able to run. This will inform the Konfuzio Server what information needs to be made available to the model before running an extraction. If the model only needs text, you can ignore this step or add requires_text = True to make it explicit. If the model requires Page images, you will need to add requires_images = True. Finally, in our case we also need to add requires_segmentation = True to inform the Server that the model needs the visual segmentation information created by the Paragraph Tokenizer in detectron mode.

  4. Initialize the model

    def __init__(
       self,
       category: Category = None,
       *args,
       **kwargs,
    ):
       """Initialize ParagraphExtractionAI."""
       logger.info("Initializing ParagraphExtractionAI.")
       super().__init__(category=category, *args, **kwargs)
       self.tokenizer = ParagraphTokenizer(mode='detectron', create_detectron_labels=True)
    

    We initialize the model by calling the __init__ method of the parent class. The only required argument is the Category the Extraction AI will be used with. The Category is the Konfuzio object that contains all the Labels and LabelSets that the model will use to create Annotations. This means that you need to make sure that the Category object contains all the Labels and LabelSets that you need for your model. In our case, we need the figure, table, list, text and title Labels.

  5. Define the extract method

        def extract(self, document: Document) -> Document:
            """
            Infer information from a given Document.
    
            :param document: Document object
            :return: Document with predicted Labels
    
            :raises:
            AttributeError: When missing a Tokenizer
            """
            inference_document = super().extract(document)
    
            inference_document = self.tokenizer.tokenize(inference_document)
    
            return inference_document
    

    The extract method is the core of the Extraction AI. It takes a Document as input and returns a Document with Annotations. Make sure to do a deepcopy of the Document that is passed so that you add the new Annotations to a Virtual Document with no Annotations. The Annotations are created by the model and added to the Document. In our case, we simply call the Paragraph Tokenizer in detectron mode and with the create_detectron_labels option.

  6. [OPTIONAL] Define the check_is_ready method

        def check_is_ready(self):
            """
            Check if the ExtractionAI is ready for the inference.
    
            :raises AttributeError: When no Category is specified.
            :raises IndexError: When the Category does not contain the required Labels.
            """
            super().check_is_ready()
    
            self.project.get_label_by_name('figure')
            self.project.get_label_by_name('table')
            self.project.get_label_by_name('list')
            self.project.get_label_by_name('text')
            self.project.get_label_by_name('title')
    
            return True
    

    The check_is_ready method is used to check if the model is ready to be used. It should return True if the model is ready to extract, and False otherwise. It is checked before saving the model. You don’t have to implement it, and it will only check that a Category is defined.

    In our case, we also check that the model contains all the Labels that we need. This is not strictly necessary, but it is a good practice to make sure that the model is ready to be used.

  7. Use the model locally

    We first make sure that all needed Labels are present in the Category.

    project = Project(id_=TEST_PROJECT_ID)
    category = project.get_category_by_id(TEST_PAYSLIPS_CATEGORY_ID)
    
    labels = ['figure', 'table', 'list', 'text', 'title']
    
    # creating Labels in case they do not exist
    label_set = project.get_label_set_by_name(category.name)  # default Category label set
    
    for label_name in labels:
        try:
            project.get_label_by_name(label_name)
        except IndexError:
            Label(project=project, text=label_name, label_sets=[label_set])
    

    We can now use the model to extract a Document. And then we also can run extract on a Document and save the model to a pickle file that can be used in Konfuzio Server.

    document = project.get_document_by_id(TEST_DOCUMENT_ID)
    
    paragraph_extraction_ai = ParagraphExtractionAI(category=category)
    
    assert paragraph_extraction_ai.check_is_ready() is True
    
    extracted_document = paragraph_extraction_ai.extract(document)
    
    print(extracted_document.annotations(use_correct=False))  # Show all the created Annotations
    
    model_path = paragraph_extraction_ai.save()
    
  8. Upload the model to Konfuzio Server

    You can use the Konfuzio SDK to upload your model to your on-prem installation like this:

    from konfuzio_sdk.api import upload_ai_model
    
    upload_ai_model(model_path=path, category_ids=[category.id_])
    

    Once the model is uploaded you can also share your model with others on the Konfuzio Marketplace.

Evaluate a Trained Extraction AI Model

In this example we will see how we can evaluate a trained RFExtractionAI model. We will assume that we have a trained pickled model available. See here for how to train such a model, and check out the Evaluation documentation for more details.


pipeline = RFExtractionAI.load_model(MODEL_PATH)
# To get the evaluation of the full pipeline
evaluation = pipeline.evaluate_full()
print(f"Full evaluation F1 score: {evaluation.f1()}")
print(f"Full evaluation recall: {evaluation.recall()}")
print(f"Full evaluation precision: {evaluation.precision()}")

# To get the evaluation of the Tokenizer alone
evaluation = pipeline.evaluate_tokenizer()
print(f"Tokenizer evaluation F1 score: {evaluation.tokenizer_f1()}")

# To get the evaluation of the Label classifier given perfect tokenization
evaluation = pipeline.evaluate_clf()
print(f"Label classifier evaluation F1 score: {evaluation.clf_f1()}")

# To get the evaluation of the LabelSet given perfect Label classification
evaluation = pipeline.evaluate_clf()
print(f"Label Set evaluation F1 score: {evaluation.f1()}")

Train a Konfuzio SDK Model to Extract Information From Payslip Documents

The tutorial RFExtractionAI Demo aims to show you how to use the Konfuzio SDK package to use a simple Whitespace tokenizer and to train a “RFExtractionAI” model to find and extract relevant information like Name, Date and Recipient from payslip documents.

You can Open In Colab or download it from here and try it by yourself.