Tutorials

Tutorials are lessons that take the reader by the hand through a series of steps to complete a project of some kind. Tutorials are learning-oriented.

Overview

The first step we’re going to cover is File Splitting – this happens when the original Document consists of several smaller sub-Documents and needs to be separated so that each one can be processed individually.

Second part is on Categorization, where a Document is labelled to be of a certain Category within the Project.

Third part describes Information Extraction, during which various information is obtained from unstructured texts, i.e. Name, Date, Recipient, or any other custom Labels.

For a more in-depth look at each step, be sure to check out the Architecture Diagram that reflects each step of the document-processing pipeline.

File Splitting

You can train your own File Splitting AI on the data from any Project of your choice. For that purpose, there are several tools in the SDK that enable processing Documents that consist of multiple files and propose splitting them into the Sub-Documents accordingly:

  • A Context Aware File Splitting Model uses a simple hands-on logic based on scanning Category’s Documents and finding strings exclusive for first Pages of all Documents within the Category. Upon predicting whether a Page is a potential splitting point (meaning whether it is first or not), we compare Page’s contents to these exclusive first-page strings; if there is occurrence of at least one such string, we mark a Page to be first (thus meaning it is a splitting point). An instance of the Context Aware File Splitting Model can be used to initially build a File Splitting pipeline and can later be replaced with more complex solutions.

    A Context Aware File Splitting Model instance can be used with an interface provided by Splitting AI – this class accepts a whole Document instead of a single Page and proposes splitting points or splits the original Documents.

  • A Multimodal File Splitting Model is a model that uses an approach that takes both visual and textual parts of the Pages and processes them independently via the combined VGG19 architecture (simplified) and LegalBERT, passing the resulting outputs together to a Multi-Layered Perceptron. Model’s output is also a prediction of a Page being first or non-first.

For developing a custom File Splitting approach, we propose an abstract class AbstractFileSplittingModel.

Train a File Splitting AI locally

Let’s see how to use the konfuzio_sdk to automatically split a file into several Documents. We will be using a pre-built class SplittingAI and an instance of a trained ContextAwareFileSplittingModel. The latter uses a context-aware logic. By context-aware we mean a rule-based approach that looks for common strings between the first Pages of all Category’s Documents. Upon predicting whether a Page is a potential splitting point (meaning whether it is first or not), we compare Page’s contents to these common first-page strings; if there is occurrence of at least one such string, we mark a Page to be first (thus meaning it is a splitting point).

This tutorial can also be used with the MultimodalFileSplittingModel; the only difference in the initialization is that it does not require specifying a Tokenizer explicitly.

from konfuzio_sdk.data import Page, Category, Project
from konfuzio_sdk.trainer.file_splitting import SplittingAI
from konfuzio_sdk.trainer.file_splitting import ContextAwareFileSplittingModel
from konfuzio_sdk.trainer.information_extraction import load_model
from konfuzio_sdk.tokenizer.regex import ConnectedTextTokenizer

# initialize a Context Aware File Splitting Model and fit it

file_splitting_model = ContextAwareFileSplittingModel(
    categories=project.categories, tokenizer=ConnectedTextTokenizer()
)
# to run a Multimodal File Splitting Model instead, replace the line above with the following lines. note that
# training a Multimodal File Splitting Model can take longer that Context Aware File Splitting Model.
#
# from konfuzio_sdk.trainer.file_splitting import MultimodalFileSplittingModel
# file_splitting_model = MultimodalFileSplittingModel(categories=project.categories)

# for an example run, you can take only a slice of training documents to make fitting faster
file_splitting_model.documents = file_splitting_model.documents[:10]

file_splitting_model.fit(allow_empty_categories=True)

# save the model
save_path = file_splitting_model.save(include_konfuzio=True)

# run the prediction and see its confidence
for page in test_document.pages():
    pred = file_splitting_model.predict(page)
    if pred.is_first_page:
        print(
            'Page {} is predicted as the first. Confidence: {}.'.format(page.number, page.is_first_page_confidence)
        )
    else:
        print('Page {} is predicted as the non-first.'.format(page.number))

# usage with the Splitting AI – you can load a pre-saved model or pass an initialized instance as the input
# in this example, we load a previously saved one
model = load_model(save_path)

# initialize the Splitting AI
splitting_ai = SplittingAI(model)

# Splitting AI is a more high-level interface to Context Aware File Splitting Model and any other models that can be
# developed for File Splitting purposes. It takes a Document as an input, rather than individual Pages, because it
# utilizes page-level prediction of possible split points and returns Document or Documents with changes depending
# on the prediction mode.

# Splitting AI can be run in two modes: returning a list of Sub-Documents as the result of the input Document
# splitting or returning a copy of the input Document with Pages predicted as first having an attribute
# "is_first_page". The flag "return_pages" has to be True for the latter; let's use it
new_document = splitting_ai.propose_split_documents(test_document, return_pages=True)
print(new_document)
# output: [predicted_document]

for page in new_document[0].pages():
    if page.is_first_page:
        print(
            'Page {} is predicted as the first. Confidence: {}.'.format(page.number, page.is_first_page_confidence)
        )
    else:
        print('Page {} is predicted as the non-first.'.format(page.number))

Develop and save a custom File Splitting AI

In this tutorial, you will learn how to train a custom File Splitting AI on the data from a Project of your choice and save a trained model for further usage.

Intro

It’s common for multi-paged files to not be perfectly organized, and in some cases, multiple independent Documents may be included in a single file. To ensure that these Documents are properly processed and separated, we will be discussing a method for identifying and splitting them into individual, independent Sub-documents.

../_images/multi_file_document_example.png

Multi-file Document Example

Konfuzio SDK offers two ways for separating Documents that may be included in a single file. One of them is training the instance of the Multimodal File Splitting Model for file splitting that would predict whether a Page is first or not and running the Splitting AI with it. Multimodal File Splitting Model is a combined approach based on architecture that processes textual and visual data from the Documents separately (in our case, using BERT and VGG19 simplified architectures respectively) and then combines the outputs which go into a Multi-layer Perceptron architecture as inputs. A more detailed scheme of the architecture can be found further.

If you hover over the image you can zoom or use the full page mode.

Another approach is context-aware file splitting logic which is presented by Context Aware File Splitting Model. This approach involves analyzing the contents of each Page and identifying similarities to the first Pages of the Document. It will allow us to define splitting points and divide the Document into multiple Sub-documents. It’s important to note that this approach is only effective for Documents written in the same language and that the process must be repeated for each Category. In this tutorial, we will explain how to implement the class for this model step by step.

If you are unfamiliar with the SDK’s main concepts (like Page or Span), you can get to know them at Data Layer Concepts.

Quick explanation

The first step in implementing this method is “training”: this involves tokenizing the Document by splitting its text into parts, specifically into strings without line breaks. We then gather the exclusive strings from Spans, which are the parts of the text in the Page, and compare them to the first Pages of each Document in the training data.

Once we have identified these strings, we can use them to determine whether a Page in an input Document is a first Page or not. We do this by going through the strings in the Page and comparing them to the set of strings collected in the training stage. If we find at least one string that intersects between the current Page and the strings from the first step, we believe it is the first Page.

Note that the more Documents we use in the training stage, the less intersecting strings we are likely to find. If you find that your set of first-page strings is empty, try using a smaller slice of the dataset instead of the whole set. Generally, when used on Documents within the same Category, this algorithm should not return an empty set. If that is the case, it’s worth checking if your data is consistent, for example, not in different languages or containing other Categories.

Step-by-step explanation

In this section, we will walk you through the process of setting up the ContextAwareFileSplittingModel class, which can be found in the code block at the bottom of this page. This class is already implemented and can be imported using from konfuzio_sdk.trainer.file_splitting import ContextAwareFileSplittingModel.

Note that any custom File Splitting AI (derived from AbstractFileSplittingModel class) requires having the following methods implemented:

  • __init__ to initialize key variables required by the custom AI;

  • fit to define architecture and training that the model undergoes, i.e. a certain NN architecture or a custom

  • hardcoded logic;

  • predict to define how the model classifies Pages as first or non-first. NB: the classification needs to be run on the Page level, not the Document level – the result of classification is reflected in is_first_page attribute value, which is unique to the Page class and is not present in Document class. Pages with is_first_page = True become splitting points, thus, each new Sub-Document has a Page predicted as first as its starting point.

To begin, we will make all the necessary imports:

from konfuzio_sdk.data import Page, Category, Project
from konfuzio_sdk.trainer.file_splitting import AbstractFileSplittingModel
from konfuzio_sdk.trainer.file_splitting import SplittingAI
from konfuzio_sdk.trainer.information_extraction import load_model
from konfuzio_sdk.tokenizer.regex import ConnectedTextTokenizer

Then, let’s initialize the ContextAwareFileSplittingModel class:

class ContextAwareFileSplittingModel(AbstractFileSplittingModel):
    def __init__(self, categories: List[Category], tokenizer, *args, **kwargs):
        super().__init__(categories=categories)
        self.output_dir = self.project.model_folder
        self.tokenizer = tokenizer
        self.requires_text = True
        self.requires_images = False

The class inherits from AbstractFileSplittingModel, so we run super().__init__(categories=categories) to properly inherit its attributes. The tokenizer attribute will be used to process the text within the Document, separating it into Spans. This is done to ensure that the text in all the Documents is split using the same logic (particularly tokenization by separating on \n whitespaces by ConnectedTextTokenizer, which is used in the example in the end of the page) and it will be possible to find common Spans. It will be used for training and testing Documents as well as any Document that will undergo splitting. It’s important to note that if you run fitting with one Tokenizer and then reassign it within the same instance of the model, all previously gathered strings will be deleted and replaced by new ones. requires_images and requires_text determine whether these types of data are used for prediction; this is needed for distinguishing between preprocessing types once a model is passed into the Splitting AI.

An example of how ConnectedTextTokenizer works:

# before tokenization
test_document = project.get_document_by_id(YOUR_DOCUMENT_ID)
test_document.text

# output: "Hi all,\nI like bread.\n\fI hope to get everything done soon.\n\fMorning,\n\fI'm glad to see you."
#             "\n\fMorning,"

test_document.spans()

# output: []

test_document = tokenizer.tokenize(test_document)

# after tokenization
test_document.spans()

# output: [Span (0, 7), Span (8, 21), Span (22, 58), Span (59, 68), Span (69, 90), Span (91, 100)]

test_document.spans()[0].offset_string

# output: "Hi all,"

The first method to define will be the fit() method. For each Category, we call exclusive_first_page_strings method, which allows us to gather the strings that appear on the first Page of each Document. allow_empty_categories allows for returning empty lists for Categories that haven’t had any exclusive first-page strings found across their Documents. This means that those Categories would not be used in the prediction process.

    def fit(self, allow_empty_categories: bool = False, *args, **kwargs):
        for category in self.categories:
            # method exclusive_first_page_strings fetches a set of first-page strings exclusive among the Documents
            # of a given Category. they can be found in _exclusive_first_page_strings attribute of a Category after
            # the method has been run. this is needed so that the information remains even if local variable
            # cur_first_page_strings is lost.
            cur_first_page_strings = category.exclusive_first_page_strings(tokenizer=self.tokenizer)
            if not cur_first_page_strings:
                if allow_empty_categories:
                    logger.warning(
                        f'No exclusive first-page strings were found for {category}, so it will not be used '
                        f'at prediction.'
                    )
                else:
                    raise ValueError(f'No exclusive first-page strings were found for {category}.')

Next, we define predict() method. The method accepts a Page as an input and checks its Span set for containing first-page strings for each of the Categories. If there is at least one intersection, the Page is predicted to be a first Page. If there are no intersections, the Page is predicted to be a non-first Page.

    def predict(self, page: Page) -> Page:
        self.check_is_ready()
        page.is_first_page = False
        for category in self.categories:
            cur_first_page_strings = category.exclusive_first_page_strings(tokenizer=self.tokenizer)
            intersection = {span.offset_string.strip('\f').strip('\n') for span in page.spans()}.intersection(
                cur_first_page_strings
            )
            if len(intersection) > 0:
                page.is_first_page = True
                break
        page.is_first_page_confidence = 1
        return page

Lastly, a check_is_ready() method is defined. This method is used to ensure that a model is ready for prediction: the checks cover that the Tokenizer and a set of Categories is defined, and that at least one of the Categories has exclusive first-page strings.

    def check_is_ready(self):
        if self.tokenizer is None:
            raise AttributeError(f'{self} missing Tokenizer.')

        if not self.categories:
            raise AttributeError(f'{self} requires Categories.')

        empty_first_page_strings = [
            category
            for category in self.categories
            if not category.exclusive_first_page_strings(tokenizer=self.tokenizer)
        ]
        if len(empty_first_page_strings) == len(self.categories):
            raise ValueError(
                f"Cannot run prediction as none of the Categories in {self.project} have "
                f"_exclusive_first_page_strings."
            )

Full code of class:

class ContextAwareFileSplittingModel(AbstractFileSplittingModel):
    def __init__(self, categories: List[Category], tokenizer, *args, **kwargs):
        super().__init__(categories=categories)
        self.output_dir = self.project.model_folder
        self.tokenizer = tokenizer
        self.requires_text = True
        self.requires_images = False

    def fit(self, allow_empty_categories: bool = False, *args, **kwargs):
        for category in self.categories:
            # method exclusive_first_page_strings fetches a set of first-page strings exclusive among the Documents
            # of a given Category. they can be found in _exclusive_first_page_strings attribute of a Category after
            # the method has been run. this is needed so that the information remains even if local variable
            # cur_first_page_strings is lost.
            cur_first_page_strings = category.exclusive_first_page_strings(tokenizer=self.tokenizer)
            if not cur_first_page_strings:
                if allow_empty_categories:
                    logger.warning(
                        f'No exclusive first-page strings were found for {category}, so it will not be used '
                        f'at prediction.'
                    )
                else:
                    raise ValueError(f'No exclusive first-page strings were found for {category}.')

    def predict(self, page: Page) -> Page:
        self.check_is_ready()
        page.is_first_page = False
        for category in self.categories:
            cur_first_page_strings = category.exclusive_first_page_strings(tokenizer=self.tokenizer)
            intersection = {span.offset_string.strip('\f').strip('\n') for span in page.spans()}.intersection(
                cur_first_page_strings
            )
            if len(intersection) > 0:
                page.is_first_page = True
                break
        page.is_first_page_confidence = 1
        return page

    def check_is_ready(self):
        if self.tokenizer is None:
            raise AttributeError(f'{self} missing Tokenizer.')

        if not self.categories:
            raise AttributeError(f'{self} requires Categories.')

        empty_first_page_strings = [
            category
            for category in self.categories
            if not category.exclusive_first_page_strings(tokenizer=self.tokenizer)
        ]
        if len(empty_first_page_strings) == len(self.categories):
            raise ValueError(
                f"Cannot run prediction as none of the Categories in {self.project} have "
                f"_exclusive_first_page_strings."
            )

A quick example of the class’s usage:

# initialize a Project and fetch a test Document of your choice
project = Project(id_=YOUR_PROJECT_ID)
test_document = project.get_document_by_id(YOUR_DOCUMENT_ID)

# initialize a Context Aware File Splitting Model and fit it

file_splitting_model = ContextAwareFileSplittingModel(
    categories=project.categories, tokenizer=ConnectedTextTokenizer()
)

# for an example run, you can take only a slice of training documents to make fitting faster
file_splitting_model.documents = file_splitting_model.documents[:10]

file_splitting_model.fit(allow_empty_categories=True)

# save the model
save_path = file_splitting_model.save(include_konfuzio=True)

# run the prediction and see its confidence
for page in test_document.pages():
    pred = file_splitting_model.predict(page)
    if pred.is_first_page:
        print(
            'Page {} is predicted as the first. Confidence: {}.'.format(page.number, page.is_first_page_confidence)
        )
    else:
        print('Page {} is predicted as the non-first.'.format(page.number))

# usage with the Splitting AI – you can load a pre-saved model or pass an initialized instance as the input
# in this example, we load a previously saved one
model = load_model(save_path)

# initialize the Splitting AI
splitting_ai = SplittingAI(model)

# Splitting AI is a more high-level interface to Context Aware File Splitting Model and any other models that can be
# developed for File Splitting purposes. It takes a Document as an input, rather than individual Pages, because it
# utilizes page-level prediction of possible split points and returns Document or Documents with changes depending
# on the prediction mode.

# Splitting AI can be run in two modes: returning a list of Sub-Documents as the result of the input Document
# splitting or returning a copy of the input Document with Pages predicted as first having an attribute
# "is_first_page". The flag "return_pages" has to be True for the latter; let's use it
new_document = splitting_ai.propose_split_documents(test_document, return_pages=True)
print(new_document)
# output: [predicted_document]

for page in new_document[0].pages():
    if page.is_first_page:
        print(
            'Page {} is predicted as the first. Confidence: {}.'.format(page.number, page.is_first_page_confidence)
        )
    else:
        print('Page {} is predicted as the non-first.'.format(page.number))

FileSplittingEvaluation class

FileSplittingEvaluation class can be used to evaluate performance of Context-Aware File Splitting Model, returning a set of metrics that includes precision, recall, f1 measure, True Positives, False Positives, True Negatives, and False Negatives.

The class’s methods calculate() and calculate_by_category() are run at initialization. The class receives two lists of Documents as an input – first list consists of ground-truth Documents where all first Pages are marked as such, second is of Documents on Pages of which File Splitting Model ran a prediction of them being first or non-first.

The initialization would look like this:

evaluation = FileSplittingEvaluation(
    ground_truth_documents=YOUR_GROUND_TRUTH_LIST, prediction_documents=YOUR_PREDICTION_LIST
)

The class compares each pair of Pages. If a Page is labeled as first and the model also predicted it as first, it is considered a True Positive. If a Page is labeled as first but the model predicted it as non-first, it is considered a False Negative. If a Page is labeled as non-first but the model predicted it as first, it is considered a False Positive. If a Page is labeled as non-first and the model also predicted it as non-first, it is considered a True Negative.

predicted correctly

predicted incorrectly

first Page

TP

FN

non-first Page

TN

FP

After iterating through all Pages of all Documents, precision, recall and f1 measure are calculated. If you wish to set metrics to None in case there has been an attempt of zero division, set allow_zero=True at the initialization.

To see a certain metric after the class has been initialized, you can call a metric’s method:

print(evaluation.fn())

It is also possible to look at the metrics calculated by each Category independently. For this, pass search=YOUR_CATEGORY_HERE when calling the wanted metric’s method:

print(evaluation.fn(search=YOUR_CATEGORY))

For more details, see the Python API Documentation on Evaluation.

Example of evaluation input and output

Suppose in our test dataset we have 2 Documents of 2 Categories: one 3-paged, consisting of a single file (-> it has only one ground-truth first Page) of a first Category, and one 5-paged, consisting of three files: two 2-paged and one 1-paged (-> it has three ground-truth first Pages), of a second Category.

../_images/document_example_1.png

First document

../_images/document_example_2.png

Second document

from konfuzio_sdk.data import Document, Page
from konfuzio_sdk.evaluate import FileSplittingEvaluation, EvaluationCalculator
from konfuzio_sdk.trainer.file_splitting import SplittingAI
# This example builds the Documents from scratch and without uploading a Supported File.
# If you uploaded your Document to the Konfuzio Server, you can just retrieve it with:
# document_1 = project.get_document_by_id(YOUR_DOCUMENT_ID)
text_1 = "Hi all,\nI like bread.\nI hope to get everything done soon.\nHave you seen it?"
document_1 = Document(id_=20, project=YOUR_PROJECT, category=YOUR_CATEGORY_1, text=text_1, dataset_status=3)
_ = Page(
    id_=None, original_size=(320, 240), document=document_1, start_offset=0, end_offset=21, number=1, copy_of_id=29
)

_ = Page(
    id_=None, original_size=(320, 240), document=document_1, start_offset=22, end_offset=57, number=2, copy_of_id=30
)

_ = Page(
    id_=None, original_size=(320, 240), document=document_1, start_offset=58, end_offset=75, number=3, copy_of_id=31
)

# As with the previous example Document, you can just retrieve an online Document with
# document_2 = project.get_document_by_id(YOUR_DOCUMENT_ID)
text_2 = "Evening,\nthank you for coming.\nI like fish.\nI need it.\nEvening."
document_2 = Document(id_=21, project=YOUR_PROJECT, category=YOUR_CATEGORY_2, text=text_2, dataset_status=3)
_ = Page(
    id_=None, original_size=(320, 240), document=document_2, start_offset=0, end_offset=8, number=1, copy_of_id=32
)
_ = Page(
    id_=None, original_size=(320, 240), document=document_2, start_offset=9, end_offset=30, number=2, copy_of_id=33
)
_ = Page(
    id_=None, original_size=(320, 240), document=document_2, start_offset=31, end_offset=43, number=3, copy_of_id=34
)
_.is_first_page = True
_ = Page(
    id_=None, original_size=(320, 240), document=document_2, start_offset=44, end_offset=54, number=4, copy_of_id=35
)
_ = Page(
    id_=None, original_size=(320, 240), document=document_2, start_offset=55, end_offset=63, number=5, copy_of_id=36
)
_.is_first_page = True

We need to pass two lists of Documents into the FileSplittingEvaluation class. So, before that, we need to run each Page of the Documents through the model’s prediction.

Let’s say the evaluation gave good results, with only one first Page being predicted as non-first and all the other Pages being predicted correctly. An example of how the evaluation would be implemented would be:

splitting_ai = SplittingAI(YOUR_MODEL)
pred_1: Document = splitting_ai.propose_split_documents(document_1, return_pages=True)[0]
pred_2: Document = splitting_ai.propose_split_documents(document_2, return_pages=True)[0]
evaluation = FileSplittingEvaluation(
    ground_truth_documents=[document_1, document_2], prediction_documents=[pred_1, pred_2]
)

print(evaluation.tp())
# returns: 3
print(evaluation.tn())
# returns: 4
print(evaluation.fp())
# returns: 0
print(evaluation.fn())
# returns: 1
print(evaluation.precision())
# returns: 1
print(evaluation.recall())
# returns: 0.75
print(evaluation.f1())
# returns: 0.85

Our results could be reflected in a following table:

TPs

TNs

FPs

FNs

precision

recall

F1

3

4

0

1

1

0.75

0.85

If we want to see evaluation results by Category, the implementation of the Evaluation would look like this:

print(evaluation.tp(search=YOUR_CATEGORY_1), evaluation.tp(search=YOUR_CATEGORY_2))
# returns: 1 2
print(evaluation.tn(search=YOUR_CATEGORY_1), evaluation.tn(search=YOUR_CATEGORY_2))
# returns: 2 2
print(evaluation.fp(search=YOUR_CATEGORY_1), evaluation.fp(search=YOUR_CATEGORY_2))
# returns: 0 0
print(evaluation.fn(search=YOUR_CATEGORY_1), evaluation.fn(search=YOUR_CATEGORY_2))
# returns: 0 1
print(evaluation.precision(search=YOUR_CATEGORY_1), evaluation.precision(search=YOUR_CATEGORY_2))
# returns: 1 1
print(evaluation.recall(search=YOUR_CATEGORY_1), evaluation.recall(search=YOUR_CATEGORY_2))
# returns: 1 0.66
print(evaluation.f1(search=YOUR_CATEGORY_1), evaluation.f1(search=YOUR_CATEGORY_2))
# returns: 1 0.8

the output could be reflected in a following table:

Category

TPs

TNs

FPs

FNs

precision

recall

F1

Category 1

1

2

0

0

1

1

1

Category 2

2

2

0

1

1

0.66

0.8

To log metrics after evaluation, you can call EvaluationCalculator‘s method metrics_logging (you would need to specify the metrics accordingly at the class’s initialization). Example usage:

EvaluationCalculator(tp=3, fp=0, fn=1, tn=4).metrics_logging()

Document Categorization

Working with the Category of a Document and its individual Pages

You can initialize a Document with a Category, which will count as if a human manually revised it.

project = Project(id_=YOUR_PROJECT_ID)
my_category = project.get_category_by_id(YOUR_CATEGORY_ID)

my_document = Document(text="My text.", project=project, category=my_category)
assert my_document.category == my_category
assert my_document.category_is_revised is True

If a Document is initialized with no Category, it will automatically be set to NO_CATEGORY. Another Category can be manually set later.

document = project.get_document_by_id(YOUR_DOCUMENT_ID)
assert document.category == project.no_category
document.set_category(my_category)
assert document.category == my_category
assert document.category_is_revised is True
# This will set it for all of its Pages as well.
for page in document.pages():
    assert page.category == my_category

If you use a Categorization AI to automatically assign a Category to a Document (such as the NameBasedCategorizationAI), each Page will be assigned a Category Annotation with predicted confidence information, and the following properties will be accessible. You can also find these documented under API Reference - Document, API Reference - Page and API Reference - Category Annotation.

Property

Description

CategoryAnnotation.category

The AI predicted Category of this Category
Annotation.

CategoryAnnotation.confidence

The AI predicted confidence of this Category
Annotation.

Document.category_annotations

List of predicted Category Annotations at the
Document level.

Document.maximum_confidence_category_annotation

Get the maximum confidence predicted Category
Annotation, or the human revised one if present.

Document.maximum_confidence_category

Get the maximum confidence predicted Category
or the human revised one if present.

Document.category

Returns a Category only if all Pages have same
Category, otherwise None. In that case, it hints
to the fact that the Document should probably
be revised or split into Documents with
consistently categorized Pages.

Page.category_annotations

List of predicted Category Annotations at the
Page level.

Page.maximum_confidence_category_annotation

Get the maximum confidence predicted Category
Annotation or the one revised by the user for this
Page.

Page.category

Get the maximum confidence predicted Category
or the one revised by user for this Page.

Name-based Categorization AI

Use the name of the Category as an effective fallback logic to categorize Documents when no Categorization AI is available:

from konfuzio_sdk.data import Project, Document
from konfuzio_sdk.trainer.document_categorization import NameBasedCategorizationAI

# Set up your Project.
project = Project(id_=YOUR_PROJECT_ID)

# Initialize the Categorization Model.
categorization_model = NameBasedCategorizationAI(project.categories)

# Retrieve a Document to categorize.
test_document = project.get_document_by_id(YOUR_DOCUMENT_ID)

# The Categorization Model returns a copy of the SDK Document with Category attribute
# (use inplace=True to maintain the original Document instead).
# If the input Document is already categorized, the already present Category is used
# (use recategorize=True if you want to force a recategorization).
result_doc = categorization_model.categorize(document=test_document)

# Each Page is categorized individually.
for page in result_doc.pages():
    assert page.category == project.categories[0]
    print(f"Found category {page.category} for {page}")

# The Category of the Document is defined when all pages' Categories are equal.
# If the Document contains mixed Categories, only the Page level Category will be defined,
# and the Document level Category will be NO_CATEGORY.
print(f"Found category {result_doc.category} for {result_doc}")

Model-based Categorization AI

Build, train and test a Categorization AI using Image Models and Text Models to classify the image and text of each Page.

For a list of available Models see Available Categorization Models.

from konfuzio_sdk.data import Project, Document
from konfuzio_sdk.trainer.information_extraction import load_model
from konfuzio_sdk.trainer.document_categorization import build_categorization_ai_pipeline
from konfuzio_sdk.trainer.document_categorization import ImageModel, TextModel

# Set up your Project.
project = Project(id_=YOUR_PROJECT_ID)
# Build the Categorization AI architecture using a template
# of pre-built Image and Text classification Models.
categorization_pipeline = build_categorization_ai_pipeline(
    categories=project.categories,
    documents=project.documents,
    test_documents=project.test_documents,
    image_model=ImageModel.EfficientNetB0,
    text_model=TextModel.NBOWSelfAttention,
)
# Train the AI.
categorization_pipeline.fit(n_epochs=1, optimizer={'name': 'Adam'})
# Evaluate the AI
data_quality = categorization_pipeline.evaluate(use_training_docs=True)
ai_quality = categorization_pipeline.evaluate()
assert data_quality.f1(None) == 1.0
assert ai_quality.f1(None) == 1.0
# Categorize a Document
document = project.get_document_by_id(YOUR_DOCUMENT_ID)
categorization_result = categorization_pipeline.categorize(document=document)
assert isinstance(categorization_result, Document)
for page in categorization_result.pages():
    print(f"Found category {page.category} for {page}")
# Save and load a pickle file for the AI
pickle_ai_path = categorization_pipeline.save()
categorization_pipeline = load_model(pickle_ai_path)
from konfuzio_sdk.trainer.document_categorization import ImageModel, TextModel

# Image Models
ImageModel.VGG11
ImageModel.VGG13
ImageModel.VGG16
ImageModel.VGG19
ImageModel.EfficientNetB0
ImageModel.EfficientNetB1
ImageModel.EfficientNetB2
ImageModel.EfficientNetB3
ImageModel.EfficientNetB4
ImageModel.EfficientNetB5
ImageModel.EfficientNetB6
ImageModel.EfficientNetB7
ImageModel.EfficientNetB8

# Text Models
TextModel.NBOW
TextModel.NBOWSelfAttention
TextModel.LSTM
TextModel.BERT

Available Categorization Models

When using build_categorization_ai_pipeline, you can select which Image Module and/or Text Module to use for classification. At least one between the Image Model or the Text Model must be specified. Both can also be used at the same time.

The list of available Categorization Models is implemented as an Enum containing the following elements:

from konfuzio_sdk.trainer.document_categorization import ImageModel, TextModel

# Image Models
ImageModel.VGG11
ImageModel.VGG13
ImageModel.VGG16
ImageModel.VGG19
ImageModel.EfficientNetB0
ImageModel.EfficientNetB1
ImageModel.EfficientNetB2
ImageModel.EfficientNetB3
ImageModel.EfficientNetB4
ImageModel.EfficientNetB5
ImageModel.EfficientNetB6
ImageModel.EfficientNetB7
ImageModel.EfficientNetB8

# Text Models
TextModel.NBOW
TextModel.NBOWSelfAttention
TextModel.LSTM
TextModel.BERT

See more details about these Categorization Models under API Reference - Categorization AI.

Categorization AI Overview Diagram

In the first diagram, we show the class hierarchy of the available Categorization Models within the SDK. Note that the Multimodal Model simply consists of a Multi Layer Perceptron to concatenate the feature outputs of a Text Model and an Image Model, such that the predictions from both Models can be unified in a unique Category prediction.

In the second diagram, we show how these models are contained within a Model-based Categorization AI. The Categorization AI class provides the high level interface to categorize Documents, as exemplified in the code examples above. It uses a Page Categorization Model to categorize each Page. The Page Categorization Model is a container for Categorization Models: it wraps the feature output layers of each contained Model with a Dropout Layer and a Fully Connected Layer.

Document Information Extraction

Train a Konfuzio SDK Model to Extract Information From Payslip Documents

The tutorial RFExtractionAI Demo aims to show you how to use the Konfuzio SDK package to use a simple Whitespace tokenizer and to train a “RFExtractionAI” model to find and extract relevant information like Name, Date and Recipient from payslip documents.

You can OpenInColab or download it from here and try it by yourself.

Evaluate a Trained Extraction AI Model

In this example we will see how we can evaluate a trained RFExtractionAI model. We will assume that we have a trained pickled model available. See here for how to train such a model, and check out the Evaluation documentation for more details.

from konfuzio_sdk.trainer.information_extraction import load_model

pipeline = load_model(MODEL_PATH)

# To get the evaluation of the full pipeline
evaluation = pipeline.evaluate_full()
print(f"Full evaluation F1 score: {evaluation.f1()}")
print(f"Full evaluation recall: {evaluation.recall()}")
print(f"Full evaluation precision: {evaluation.precision()}")

# To get the evaluation of the Tokenizer alone
evaluation = pipeline.evaluate_tokenizer()
print(f"Tokenizer evaluation F1 score: {evaluation.tokenizer_f1()}")

# To get the evaluation of the Label classifier given perfect tokenization
evaluation = pipeline.evaluate_clf()
print(f"Label classifier evaluation F1 score: {evaluation.clf_f1()}")

# To get the evaluation of the LabelSet given perfect Label classification
evaluation = pipeline.evaluate_clf()
print(f"Label Set evaluation F1 score: {evaluation.f1()}")

Paragraph and Sentence Tokenizer

The ParagraphTokenizer and SentenceTokenizer are used to split a document into paragraphs and sentences respectively. They both come with two different modes: detectron and line_distance. The detectron mode uses a fine tuned Detectron2 model to detect paragraph Annotations. The line_distance mode uses a rule based approach, and is therefore faster, but tends to be less accurate. In particular, it fails to handle documents with two columns. The detectron mode is the default. It can also be used together with the create_detectron_labels setting to create Annotations with the label given by our Detectron2 model: figure, table, list, text and title.

Paragraph Tokenizer

For example, to tokenize a Document into paragraphs using the ParagraphTokenizer in detectron mode and the create_detectron_labels option to use the labels provided by our Detectron model, you can use the following code:

from konfuzio_sdk.data import Project
from konfuzio_sdk.tokenizer.paragraph_and_sentence import ParagraphTokenizer

# initialize a Project and fetch a Document to tokenize
project = Project(id_=YOUR_PROJECT_ID)
document = project.get_document_by_id(YOUR_DOCUMENT_ID)

# create the ParagraphTokenizer and tokenize the Document

tokenizer = ParagraphTokenizer(mode='detectron', create_detectron_labels=True)

document = tokenizer(document)

The resulting Annotations will look like this:

../_images/paragraph_tokenizer.png

Sentence Tokenizer

If you are interested in a more fine grained tokenization, you can use the SentenceTokenizer. It can be used to create Annotations for each individual sentence in a text Document. To use it, you can use the following code:

from konfuzio_sdk.data import Project
from konfuzio_sdk.tokenizer.paragraph_and_sentence import SentenceTokenizer

# initialize a Project and fetch a Document to tokenize
project = Project(id_=YOUR_PROJECT_ID)
document = project.get_document_by_id(YOUR_DOCUMENT_ID)

# create the SentenceTokenizer and tokenize the Document

tokenizer = SentenceTokenizer(mode='detectron')

document = tokenizer(document)

The resulting Annotations will look like this:

../_images/sentence_tokenizer.png

Data Validation Rules

Konfuzio automatically applies a set of rules for validating data within a Project. Data Validation Rules ensure that Training and Test data is consistent and well formed for training an Extraction AI with Konfuzio.

In general, if a Document fails any of the checks described in the next sections, it will not be possible to train an AI with that Document.

More specifically:

  • If a Document fails any of the checks described in the Bbox Validation Rules section, it will not be possible to initialize the Project as a Python object (such as with project = Project(YOUR_PROJECT_ID)), and a ValueError will be raised. All other Documents in the Project will be able to be initialized.

  • If a Document fails any of the checks described in the sections Annotation Validation Rules and Span Validation Rules, it will not be possible to retrieve the Annotations (including their Spans) that fail the specific checks (such as with annotation = document.get_annotation_by_id(YOUR_ANNOTATION_ID)), and a ValueError will be raised. All other Annotations in the Document will be retrievable.

Initializing a Project with the Data Validation Rules enabled

By default, any Project has the Data Validation Rules enabled, so nothing special needs to be done to enable it.

def test_data_validation():
    """Test data validation."""
    from tests.variables import TEST_PROJECT_ID

Document Validation Rules

A Document passes the Data Validation Rules only if all the contained Annotations, Spans and Bboxes pass the Data Validation Rules. If at least one Annotation, Span, or Bbox within a Document fails one of the following checks, the entire Document is marked as unsuitable for training an Extraction AI.

Annotation Validation Rules

An Annotation passes the Data Validation Rules only if:

  1. The Annotation is not from a Category different from the Document’s Category

  2. The Annotation is not entirely overlapping with another Annotation with the same Label

    • It implies that partial overlaps with same Labels are allowed

    • It implies that full overlaps with different Labels are allowed

  3. The Annotation has at least one Span

Please note that the Annotation Validation Rules are indifferent about the values of Annotation.is_correct or Annotation.revised. For more information about what these boolean values mean, see Konfuzio Server - Annotations.

Span Validation Rules

A Span passes the Data Validation Rules only if:

  1. The Span contains non-empty text (the start offset must be strictly greater than the end offset)

  2. The Span is contained within a single line of text (must not be distributed across multiple lines)

Bbox Validation Rules

A Bbox passes the Data Validation Rules only if:

  1. The Bbox has non-negative width and height (zero is allowed for compatibility reasons with many OCR engines)

  2. The Bbox is entirely contained within the bounds of a Page

  3. The character that is mapped by the Bbox must correspond to the text in the Document

Initializing a Project with the Data Validation Rules disabled

By default, any Project has the Data Validation Rules enabled.

A possible reason for choosing to disable the Data Validation Rules that come with the Konfuzio SDK, is that an expert user wants to define a custom data structure or training pipeline which violates some assumptions normally present in Konfuzio Extraction AIs and pipelines. If you don’t want to validate your data, you should initialize the Project with strict_data_validation=False.

We highly recommend to keep the Data Validation Rules enabled at all times, as it ensures that Training and Test data is consistent for training an Extraction AI. Disabling the Data Validation Rules and training an Extraction AI with potentially duplicated, malformed, or inconsistent data can decrease the quality of an Extraction AI. Only disable them if you know what you are doing.

def test_data_validation():
    """Test data validation."""
    YOUR_PROJECT_ID = TEST_PROJECT_ID

Find possible outliers among the ground-truth Annotations

If you want to ensure that Annotations of a Label are consistent and check for possible outliers, you can use one of the Label class’s methods. There are three of them available.

  • get_probable_outliers_by_regex looks for the worst regexes used to find the Annotations. “Worst” is determined by the number of True Positives calculated upon evaluating the regexes’ performance. Returns Annotations predicted by the regexes with the least amount of True Positives. By default, the method returns Annotations retrieved by the regex that performs on the level of 10% in comparison to the best one.

    from konfuzio_sdk.data import Project
    
    project = Project(id_=YOUR_PROJECT_ID)
    label = project.get_label_by_name(YOUR_LABEL_NAME)
    outliers = label.get_probable_outliers_by_regex(project.categories)
    
  • get_probable_outliers_by_confidence looks for the Annotations with the least confidence level, provided it is lower than the specified threshold (the default threshold is 0.5). Accepts an instance of EvaluationExtraction class as an input and uses confidence predictions from there.

    from konfuzio_sdk.data import Project
    
    project = Project(id_=YOUR_PROJECT_ID)
    label = project.get_label_by_name(YOUR_LABEL_NAME)
    outliers = label.get_probable_outliers_by_confidence(evaluation)
    
  • get_probable_outliers_by_normalization looks for the Annotations that are unable to pass normalization by the data type of the given Label (meaning that they are not of the same data type themselves, thus outliers).

    from konfuzio_sdk.data import Project
    
    project = Project(id_=YOUR_PROJECT_ID)
    label = project.get_label_by_name(YOUR_LABEL_NAME)
    outliers = label.get_probable_outliers_by_normalization(project.categories)
    

All three of the methods return a list of Annotations that are deemed outliers by the logic of the current method; the contents of the output are not necessarily wrong, however, they may have some difference from the main body of the Annotations under a given Label.

To have a more thorough check, you can use a method get_probable_outliers that allows for combining the aforementioned methods or have them run together and return only those Annotations that were detected by all of them.

Here’s an example of running the latter method with one of the search methods disabled explicitly. By default, all three of the search methods are enabled.

from konfuzio_sdk.data import Project

project = Project(id_=YOUR_PROJECT_ID)
label = project.get_label_by_name(YOUR_LABEL_NAME)
outliers = label.get_probable_outliers(project.categories, confidence_search=False)

Create Regex-based Annotations

Pro Tip: Read our technical blog post Automated Regex to find out how we use Regex to detect outliers in our annotated data.

Let’s see a simple example of how can we use the konfuzio_sdk package to get information on a project and to post annotations.

You can follow the example below to post annotations of a certain word or expression in the first document uploaded.

import re

from konfuzio_sdk.data import Project, Annotation, Span

my_project = Project(id_=YOUR_PROJECT_ID)
# Word/expression to annotate in the Document
# should match an existing one in your Document
input_expression = "Musterstraße"

# Label for the Annotation
label_name = "Lohnart"
# Getting the Label from the Project
my_label = my_project.get_label_by_name(label_name)

# LabelSet to which the Label belongs
label_set = my_label.label_sets[0]

# First Document in the Project
document = my_project.documents[0]

# Matches of the word/expression in the Document
matches_locations = [(m.start(0), m.end(0)) for m in re.finditer(input_expression, document.text)]

# List to save the links to the Annotations created
new_annotations_links = []

# Create Annotation for each match
for offsets in matches_locations:
    span = Span(start_offset=offsets[0], end_offset=offsets[1])
    annotation_obj = Annotation(
        document=document, label=my_label, label_set=label_set, confidence=1.0, spans=[span], is_correct=True
    )
    new_annotation_added = annotation_obj.save()
    if new_annotation_added:
        new_annotations_links.append(annotation_obj.get_link())
    annotation_obj.delete(delete_online=True)

print(new_annotations_links)

Train Label Regex Tokenizer

You can use the konfuzio_sdk package to train a custom Regex tokenizer.

In this example, you will see how to find regex expressions that match with occurrences of the “Lohnart” Label in the training data.

from konfuzio_sdk.data import Project
from konfuzio_sdk.tokenizer.regex import RegexTokenizer
from konfuzio_sdk.tokenizer.base import ListTokenizer

my_project = Project(id_=YOUR_PROJECT_ID)
category = my_project.get_category_by_id(id_=YOUR_CATEGORY_ID)

tokenizer = ListTokenizer(tokenizers=[])

label = my_project.get_label_by_name("Lohnart")

for regex in label.find_regex(category=category):
    regex_tokenizer = RegexTokenizer(regex=regex)
    tokenizer.tokenizers.append(regex_tokenizer)

# You can then use it to create an Annotation for every matching string in a Document.
document = my_project.get_document_by_id(YOUR_DOCUMENT_ID)
tokenizer.tokenize(document)

Finding Spans of a Label Not Found by a Tokenizer

Here is an example of how to use the Label.spans_not_found_by_tokenizer method. This will allow you to determine if a RegexTokenizer is suitable at finding the Spans of a Label, or what Spans might have been annotated wrong. Say, you have a number of annotations assigned to the IBAN Label and want to know which Spans would not be found when using the WhiteSpace Tokenizer. You can follow this example to find all the relevant Spans.

from konfuzio_sdk.data import Project
from konfuzio_sdk.tokenizer.regex import WhitespaceTokenizer

my_project = Project(id_=YOUR_PROJECT_ID)
category = my_project.categories[0]

tokenizer = WhitespaceTokenizer()

label = my_project.get_label_by_name('Austellungsdatum')

spans_not_found = label.spans_not_found_by_tokenizer(tokenizer, categories=[category])

for span in spans_not_found:
    print(f"{span}: {span.offset_string}")

Tutorial: Getting Word Bounding Box (BBox) for a Document

In this tutorial, we will walk through how to extract the bounding box (BBox) for words in a Document, rather than for individual characters, using the Konfuzio SDK. This process involves the use of the WhitespaceTokenizer from the Konfuzio SDK to tokenize the Document and identify word-level Spans, which can then be visualized or used to extract BBox information.

Prerequisites

  • You will need to have the Konfuzio SDK installed.

  • You should have access to a Project on the Konfuzio platform.

Preview of Result

Steps

  1. Import necessary modules:

    from copy import deepcopy
    from konfuzio_sdk.data import Project
    from konfuzio_sdk.tokenizer.regex import WhitespaceTokenizer
    
  2. Initialize your Project:

    This involves creating a Project instance with the appropriate ID.

    project = Project(id_=YOUR_PROJECT_ID)
    
  3. Retrieve a Document from your Project:

    document = project.get_document_by_id(YOUR_DOCUMENT_ID)
    
  4. Create a copy of your Document without Annotations:

    document = deepcopy(document)
    
  5. Tokenize the Document:

    This process involves splitting the Document into word-level Spans using the WhitespaceTokenizer.

    tokenizer = WhitespaceTokenizer()
    document = tokenizer.tokenize(document)
    
  6. Visualize all word-level Annotations:

    After getting the bounding box for all Spans, you might want to visually check the results to make sure the bounding boxes are correctly assigned. Here’s how you can do it:

    document.get_page_by_index(0).get_annotations_image(display_all=True)
    
    ../_images/word-bboxes.png

    This will display an image of the Document with all word-level Annotations. The image may look a bit messy with all the Labels.

  7. Get bounding box for all Spans:

    You can retrieve bounding boxes for all word-level Spans using the following code:

    span_bboxes = [span.bbox() for span in document.spans()]
    

    Each bounding box (Bbox) in the list corresponds to a specific word and is defined by four coordinates: x0 and y0 specify the coordinates of the bottom left corner, while x1 and y1 mark the coordinates of the top right corner, thereby specifying the box’s position and dimensions on the Document Page.

Retrain Flair NER-Ontonotes-Fast with Human Revised Annotations

The tutorial HRetrain Flair NER-Ontonotes-Fast with Human Revised Annotations aims to show you how to use the Konfuzio SDK package to include an easy feedback workflow in your training pipeline. It also gives an example of how you can take advantage of open-source models to speed up the annotation process and use the feedback workflow to adapt the domain knowledge of the model to your aim.

You can OpenInColab1 or download it from here and try it by yourself.

Count Relevant Expressions in Annual Reports

The tutorial Count Relevant Expressions in Annual Reports aims to show you how to use the Konfuzio SDK package to retrieve structured and organized information that can be used for a deeper analysis and understanding of your data. It will show you how to identify and count pre-specified expressions in documents and how to collect that information.

You can OpenInColab2 or download it from here and try it by yourself.