Code Examples

Example Usage

Project

Retrieve all information available for your project:

my_project = Project(id_=YOUR_PROJECT_ID)

The information will be stored in the folder that you defined to allocate the data in the package initialization. A subfolder will be created for each document in the project.

Every time that there are changes in the project in the Konfuzio Server, the local project can be updated this way:

my_project.get(update=True)

To make sure that your project is loaded with all the latest data:

my_project = Project(id_=YOUR_PROJECT_ID, update=True)

Documents

To access the documents in the project you can use:

documents = my_project.documents

By default, it will get the documents with training status (dataset_status = 2). The code for the status is:

  • None: 0

  • Preparation: 1

  • Training: 2

  • Test: 3

  • Excluded: 4

The test documents can be accessed directly by:

test_documents = my_project.test_documents

For more details, you can checkout the Project documentation.

By default, you get 4 files for each document that contain information of the text, pages, annotation sets and annotations. You can see these files inside the document folder.

document.txt - Contains the text of the document. If OCR was used, it will correspond to the result from the OCR.

                                                            x02   328927/10103/00104
Abrechnung  der Brutto/Netto-Bezüge   für Dezember 2018                   22.05.2018 Bat:  1

Personal-Nr.  Geburtsdatum ski Faktor  Ki,Frbtr.Konfession  ‚Freibetragjährl.! |Freibetrag mt! |DBA  iGleitzone  'St.-Tg.  VJuUr. üb. |Url. Anspr. Url.Tg.gen.  |Resturlaub
00104 150356 1  |     ‚ev                              30     400  3000       3400

SV-Nummer       |Krankenkasse                       KK%®|PGRS Bars  jum.SV-Tg. Anw. Tage |UrlaubTage Krankh. Tg. Fehlz. Tage

50150356B581 AOK  Bayern Die Gesundheitskas 157 101 1111 1 30

                                             Eintritt   ‚Austritt     Anw.Std.  Urlaub Std.  |Krankh. Std. |Fehlz. Std.

                                             170299  L L       l     L     l     l
 -                                       +  Steuer-ID       IMrB?       Zeitlohn Sta.|Überstd.  |Bez. Sta.
  Teststraße123
   12345 Testort                                   12345678911           ı     ı     \
                               B/N
               Pers.-Nr.  00104        x02
               Abt.-Nr. A12         10103          HinweisezurAbrechnung

pages.json5 - Contains information of each page of the document (for example, their ids and sizes).

[
  {
    "id": 1923,
    "image": "/page/show/1923/",
    "number": 1,
    "original_size": [
      595.2,
      841.68
    ],
    "size": [
      1414,
      2000
    ]
  }
]

annotation_sets.json5 - Contains information of each section in the document (for example, their ids and label sets).

[
  {
    "id": 78730,
    "position": 1,
    "section_label": 63
  },
  {
    "id": 292092,
    "position": 1,
    "section_label": 64
  }
]

annotations.json5 - Contains information of each annotation in the document (for example, their labels and bounding boxes).

[
  {
    "accuracy": null,
    "bbox": {
      "bottom": 44.369,
      "line_index": 1,
      "page_index": 0,
      "top": 35.369,
      "x0": 468.48,
      "x1": 527.04,
      "y0": 797.311,
      "y1": 806.311
    },
    "bboxes": [
      {
        "bottom": 44.369,
        "end_offset": 169,
        "line_number": 2,
        "offset_string": "22.05.2018",
        "offset_string_original": "22.05.2018",
        "page_index": 0,
        "start_offset": 159,
        "top": 35.369,
        "x0": 468.48,
        "x1": 527.04,
        "y0": 797.311,
        "y1": 806.311
      }
    ],
    "created_by": 59,
    "custom_offset_string": false,
    "end_offset": 169,
    "get_created_by": "[email protected]",
    "get_revised_by": "n/a",
    "id": 4419937,
    "is_correct": true,
    "label": 867,
    "label_data_type": "Date",
    "label_text": "Austellungsdatum",
    "label_threshold": 0.1,--
    "normalized": "2018-05-22",
    "offset_string": "22.05.2018",
    "offset_string_original": "22.05.2018",
    "revised": false,
    "revised_by": null,
    "section": 78730,
    "section_label_id": 63,
    "section_label_text": "Lohnabrechnung",
    "selection_bbox": {
      "bottom": 44.369,
      "line_index": 1,
      "page_index": 0,
      "top": 35.369,
      "x0": 468.48,
      "x1": 527.04,
      "y0": 797.311,
      "y1": 806.311
    },
    "start_offset": 159,
    "translated_string": null
  },
...
]

Download PDFs

To get the PDFs of the documents, you can use get_file().

for document in my_project.documents:
    document.get_file()

This will download the OCR version of the document which contains the text, the bounding boxes information of the characters and the image of the document.

In the document folder, you will see a new file with the original name followed by “_ocr”.

If you want to original version of the document (without OCR) you can use ocr_version=False.

for document in my_project.documents:
    document.get_file(ocr_version=False)

In the document folder, you will see a new file with the original name.

Download pages as images

To get the pages of the document as png images, you can use get_images().

for document in my_project.documents:
    document.get_images()

You will get one png image named “page_number_of_page.png” for each page in the document.

Download bounding boxes of the characters

To get the bounding boxes information of the characters, you can use get_bbox().

for document in my_project.documents:
    document.get_bbox()

You will get a file named “bbox.json5”.

After downloading these files, the paths to them will also become available in the project instance. For example, you can get the path to the file with the document text with:

my_project.documents_folder

Update Document

If there are changes in the document in the Konfuzio Server, you can update the document with:

document.update()

If a document is part of the training or test set, you can also update it by updating the entire project via project.update(). However, for projects with many documents it can be faster to update only the relevant documents.

Delete Document

To locally delete a document, you can use:

document.delete()

The document will be deleted from your local data folder but it will remain in the Konfuzio Server. If you want to get it again you can update the project.

If you want to delete a document permanently you can do it like so:

document.delete(delete_online=True)

Upload Document

To upload a new Document in your Project using the SDK, you have the option between two Document methods: from_file_sync and from_file_async.

If you want to upload a document, and start working with it as soon as the OCR processing step is done, we recommend from_file_sync as it will wait for the Document to be processed and then return a ready Document. Beware, this may take from a few seconds up to over a minute.

document = Document.from_file_sync(FILE_PATH, project=my_project)

If however you are trying to upload a large number of files and don’t wait to wait for them to be processed you can use the asynchronous function which only returns a document ID:

document_id = Document.from_file_async(FILE_PATH, project=my_project)

Later, you can load the processed document and get your document with:

my_project.init_or_update_document(self, from_online=False)

document = my_project.get_document_by_id(document_id)

Modify Document

If you would like to use the SDK to modify some document’s meta-data like the dataset status or the assignee, you can do it like this:

document.assignee = ASSIGNEE_ID
document.dataset_status = 3

doc.save_meta_data()

Here, the assignee has been changed in the server to the user with id 43, and the status of the document has been changed to 3 (i.e. Testing).

Delete Document

If you would like to delete a Document in the remote server you can simply use the Document.delete method. You can only delete Documents with a dataset status of None (0). Be careful! Once the document is deleted online, we will have no way of recovering it.

document.dataset_status = 0

doc.save_meta_data()

doc.delete(delete_online=True)

If delete_online is set to False (the default), the Document will only be deleted on your local machine, and will be reloaded next time you load the Project, or if you run the Project.init_or_update_document method directly.

Create Regex-based Annotations

Let’s see a simple example of how we can use the konfuzio_sdk package to get information on a project and to post annotations.

You can follow the example below to post annotations of a certain word or expression in the first document uploaded.

import re

from konfuzio_sdk.data import Project, Annotation, Label

my_project = Project(id_=YOUR_PROJECT_ID)

# Word/expression to annotate in the document
# should match an existing one in your document
input_expression = "John Smith"

# Label for the annotation
label_name = "Name"
# Getting the Label from the project
my_label = my_project.get_label_by_name(label_name)

# LabelSet to which the Label belongs
label_set = my_label.label_sets[0]

# First document in the project
document = my_project.documents[0]

# Matches of the word/expression in the document
matches_locations = [(m.start(0), m.end(0)) for m in re.finditer(input_expression, document.text)]

# List to save the links to the annotations created
new_annotations_links = []

# Create annotation for each match
for offsets in matches_locations:
    span = Span(start_offset=offsets[0], end_offset=offsets[1])
    annotation_obj = Annotation(
        document=document,
        label=my_label,
        label_set=label_set,
        confidence=1.0,
        spans=[span],
        is_correct=True
    )
    new_annotation_added = annotation_obj.save()
    if new_annotation_added:
        new_annotations_links.append(annotation_obj.get_link())

print(new_annotations_links)

Train Label Regex Tokenizer

You can use the konfuzio_sdk package to train a custom Regex tokenizer.

In this example, you will see how to find regex expressions that match with occurences of the “IBAN” Label in the training data.

from konfuzio_sdk.data import Project
from konfuzio_sdk.tokenizer.regex import RegexTokenizer
from konfuzio_sdk.tokenizer.base import ListTokenizer

my_project = Project(id_=YOUR_PROJECT_ID)
category = project.get_category_by_id(id_=CATEGORY_ID)

tokenizer = ListTokenizer(tokenizers=[])

iban_label = my_project.get_label_by_name("IBAN")

for regex in iban_label.find_regex(category=category):
    regex_tokenizer = RegexTokenizer(regex=regex)
    tokenizer.tokenizers.append(regex_tokenizer)

# You can then use it to create an Annotation for every matching string in a document.
document = project.get_document_by_id(DOCUMENT_ID)
tokenizer.tokenize(document)

Finding Spans of a Label Not Found by a Tokenizer

Here is an example of how to use the Label.spans_not_found_by_tokenizer method. This will allow you to determine if a RegexTokenizer is suitable at finding the Spans of a Label, or what Spans might have been annotated wrong. Say, you have a number of annotations assigned to the IBAN Label and want to know which Spans would not be found when using the WhiteSpace Tokenizer. You can follow this example to find all the relevant Spans.

from konfuzio_sdk.data import Project, Annotation, Label
from konfuzio_sdk.tokenizer.regex import WhitespaceTokenizer

my_project = Project(id_=YOUR_PROJECT_ID)
category = Project.categories[0]

tokenizer = WhitespaceTokenizer()

iban_label = project.get_label_by_name('IBAN')

spans_not_found = iban_label.spans_not_found_by_tokenizer(tokenizer, categories=[category])

for span in spans_not_found:
    print(f"{span}: {span.offset_string}")

Evaluate a Trained Extraction AI Model

In this example we will see how we can evaluate a trained RFExtractionAI model. We will assume that we have a trained pickled model available. See here for how to train such a model, and check out the Evaluation documentation for more details.

from konfuzio_sdk.data import Project
from konfuzio_sdk.trainer.information_extraction import load_model

pipeline = load_model(MODEL_PATH)

# To get the evaluation of the full pipeline
evaluation = pipeline.evaluate_full()
print(f"Full evaluation F1 score: {evaluation.f1()}")
print(f"Full evaluation recall: {evaluation.recall()}")
print(f"Full evaluation precision: {evaluation.precision()}")

# To get the evaluation of the tokenizer alone
evaluation = pipeline.evaluate_tokenizer()
print(f"Tokenizer evaluation F1 score: {evaluation.tokenizer_f1()}")

# To get the evaluation of the Label classifier given perfect tokenization
evaluation = pipeline.evaluate_clf()
print(f"Label classifier evaluation F1 score: {evaluation.clf_f1()}")

# To get the evaluation of the LabelSet given perfect Label classification
evaluation = pipeline.evaluate_label_set_clf()
print(f"Label Set evaluation F1 score: {evaluation.f1()}")

Architecture overview

We’ll take a closer look at the most important ones and give you some examples of how they can be implemented consequentially or individually in case you want to experiment.

The first step we’re going to cover is File Splitting – this happens when the original Document consists of several smaller sub-Documents and needs to be separated so that each one can be processed individually.

Second part is on Categorization, where a Document is labelled to be of a certain Category within the Project.

Third part describes Information Extraction, during which various information is obtained from unstructured texts, i.e. Name, Date, Recipient, or any other custom Labels.

For a more in-depth look at each step, be sure to check out the diagram that reflects each step of the document-processing pipeline.

Splitting for multi-file Documents: Step-by-step guide

Intro

It’s common for multipage files to not be perfectly organized, and in some cases, multiple independent Documents may be included in a single file. To ensure that these Documents are properly processed and separated, we will be discussing a method for identifying and splitting them into individual, independent sub-documents.

../../_images/multi_file_document_example.png

Multi-file Document Example

In this section, we will explore an easy method for identifying and separating Documents that may be included in a single file. Our approach involves analyzing the contents of each Page and identifying similarities to the first Pages of the Document. This will allow us to define splitting points and divide the Document into multiple sub-documents. It’s important to note that this approach is only effective for Documents written in the same language and that the process must be repeated for each Category.

If you are unfamiliar with the SDK’s main concepts (like Page or Span), you can get to know them on the Quickstart page.

Quick explanation

The first step in implementing this method is “training”: this involves tokenizing the Document by splitting its text into parts, specifically into strings without line breaks. We then gather the exclusive strings from Spans, which are the parts of the text in the Page, and compare them to the first Pages of each Document in the training data.

Once we have identified these strings, we can use them to determine whether a Page in an input Document is a first Page or not. We do this by going through the strings in the Page and comparing them to the set of strings collected in the training stage. If we find at least one string that intersects between the current Page and the strings from the first step, we believe it is the first Page.

Note that the more Documents we use in the training stage, the less intersecting strings we are likely to find. If you find that your set of first-page strings is empty, try using a smaller slice of the dataset instead of the whole set. Generally, when used on Documents within the same Category, this algorithm should not return an empty set. If that is the case, it’s worth checking if your data is consistent, for example, not in different languages or containing other Categories.

Step-by-step explanation

In this section, we will walk you through the process of setting up the ContextAwareFileSplittingModel class, which can be found in the code block at the bottom of this page. This class is already implemented and can be imported using from konfuzio_sdk.trainer.file_splitting import ContextAwareFileSplittingModel.

Note that any custom FileSplittingAI (derived from AbstractFileSplittingModel class) requires having the following methods implemented:

  • __init__ to initialize key variables required by the custom AI;

  • fit to define architecture and training that the model undergoes, i.e. a certain NN architecture or a custom

  • hardcoded logic

  • predict to define how the model classifies Pages as first or non-first. NB: the classification needs to be run on the Page level, not the Document level – the result of classification is reflected in is_first_page attribute value, which is unique to the Page class and is not present in Document class. Pages with is_first_page = True become splitting points, thus, each new sub-Document has a Page predicted as first as its starting point.

To begin, we will make all the necessary imports and initialize the ContextAwareFileSplittingModel class:

import logging

from typing import List

from konfuzio_sdk.data import Page, Category
from konfuzio_sdk.trainer.file_splitting import AbstractFileSplittingModel
from konfuzio_sdk.trainer.information_extraction import load_model
from konfuzio_sdk.tokenizer.regex import ConnectedTextTokenizer

class ContextAwareFileSplittingModel(AbstractFileSplittingModel):
    def __init__(self, categories: List[Category], tokenizer, *args, **kwargs):
        super().__init__(categories=categories)
        self.name = self.__class__.__name__
        self.output_dir = self.project.model_folder
        self.tokenizer = tokenizer
        self.requires_text = True
        self.requires_images = False

The class inherits from AbstractFileSplittingModel, so we run super().__init__(categories=categories) to properly inherit its attributes. The tokenizer attribute will be used to process the text within the Document, separating it into Spans. This is done to ensure that the text in all the Documents is split using the same logic (particularly tokenization by separating on \n whitespaces by ConnectedTextTokenizer, which is used in the example in the end of the page) and it will be possible to find common Spans. It will be used for training and testing Documents as well as any Document that will undergo splitting. It’s important to note that if you run fitting with one tokenizer and then reassign it within the same instance of the model, all previously gathered strings will be deleted and replaced by new ones. requires_images and requires_text determine whether these types of data are used for prediction; this is needed for distinguishing between preprocessing types once a model is passed into the SplittingAI.

An example of how ConnectedTextTokenizer works:

# before tokenization
test_document = project.get_document_by_id(YOUR_DOCUMENT_ID)
test_document.text

# output: "This is an example text. \n It has several lines. \n Here it finishes."

test_document.spans()

# output: []

test_document = tokenizer.tokenize(test_document)

# after tokenization
test_document.spans()

# output: [Span (0, 24), Span(25, 47), Span(48, 65)]

test_document.spans[0].offset_string

# output: "This is an example text. "

The first method to define will be the fit() method. For each Category, we call exclusive_first_page_strings method, which allows us to gather the strings that appear on the first Page of each Document. allow_empty_categories allows for returning empty lists for Categories that haven’t had any exclusive first-page strings found across their Documents. This means that those Categories would not be used in the prediction process.

def fit(self, allow_empty_categories: bool = False, *args, **kwargs):
    for category in self.categories:
        # method exclusive_first_page_strings fetches a set of first-page strings exclusive among the Documents
        # of a given Category. they can be found in _exclusive_first_page_strings attribute of a Category after
        # the method has been run. this is needed so that the information remains even if local variable
        # cur_first_page_strings is lost.
        cur_first_page_strings = category.exclusive_first_page_strings(tokenizer=self.tokenizer)
        if not cur_first_page_strings:
            if allow_empty_categories:
                logger.warning(
                    f'No exclusive first-page strings were found for {category}, so it will not be used '
                    f'at prediction.'
                )
            else:
                raise ValueError(f'No exclusive first-page strings were found for {category}.')

Lastly, we define predict() method. The method accepts a Page as an input and checks its Span set for containing first-page strings for each of the Categories. If there is at least one intersection, the Page is predicted to be a first Page. If there are no intersections, the Page is predicted to be a non-first Page.

def predict(self, page: Page) -> Page:
    for category in self.categories:
        # exclusive_first_page_strings calls an implicit _exclusive_first_page_strings attribute once it was
        # already calculated during fit() method so it is not a recurrent calculation each time.
        if not category.exclusive_first_page_strings:
            raise ValueError(f"Cannot run prediction as {category} does not have exclusive_first_page_strings.")
    page.is_first_page = False
    for category in self.categories:
        cur_first_page_strings = category.exclusive_first_page_strings(tokenizer=self.tokenizer)
        intersection = {span.offset_string for span in page.spans()}.intersection(
            cur_first_page_strings
        )
        if len(intersection) > 0:
            page.is_first_page = True
            break
    return page

A quick example of the class’s usage:

# initialize a Project and fetch a test Document of your choice
project = Project(id_=YOUR_PROJECT_ID)
test_document = project.get_document_by_id(YOUR_DOCUMENT_ID)

# initialize a ContextAwareFileSplittingModel and fit it

file_splitting_model = ContextAwareFileSplittingModel(categories=project.categories, tokenizer=ConnectedTextTokenizer())
file_splitting_model.fit()

# save the model
file_splitting_model.output_dir = project.model_folder
file_splitting_model.save()

# run the prediction
for page in test_document.pages():
    pred = file_splitting_model.predict(page)
    if pred.is_first_page:
        print('Page {} is predicted as the first.'.format(page.number))
    else:
        print('Page {} is predicted as the non-first.'.format(page.number))

# usage with the SplittingAI – you can load a pre-saved model or pass an initialized instance as the input
# in this example, we load a previously saved one
model = load_model(project.model_folder)

# initialize the SplittingAI
splitting_ai = SplittingAI(model)

# SplittingAI is a more high-level interface to ContextAwareFileSplittingModel and any other models that can be
# developed for file-splitting purposes. It takes a Document as an input, rather than individual Pages, because it
# utilizes page-level prediction of possible split points and returns Document or Documents with changes depending on
# the prediction mode.

# SplittingAI can be ran in two modes: returning a list of sub-Documents as the result of the input Document
# splitting or returning a copy of the input Document with Pages predicted as first having an attribute
# "is_first_page". The flag "return_pages" has to be True for the latter; let's use it
new_document = splitting_ai.propose_split_documents(test_document, return_pages=True)
print(new_document)
# output: [predicted_document]

for page in new_document[0].pages():
    if page.is_first_page:
        print('Page {} is predicted as the first.'.format(page.number))
    else:
        print('Page {} is predicted as the non-first.'.format(page.number))

Full code:

import logging

from typing import List

from konfuzio_sdk.data import Page, Category
from konfuzio_sdk.trainer.file_splitting import AbstractFileSplittingModel
from konfuzio_sdk.trainer.information_extraction import load_model
from konfuzio_sdk.tokenizer.regex import ConnectedTextTokenizer

logger = logging.getLogger(__name__)

class ContextAwareFileSplittingModel(AbstractFileSplittingModel):
    """Fallback definition of a File Splitting Model."""

    class ContextAwareFileSplittingModel(AbstractFileSplittingModel):
        def __init__(self, categories: List[Category], tokenizer, *args, **kwargs):
            super().__init__(categories=categories)
            self.name = self.__class__.__name__
            self.output_dir = self.project.model_folder
            self.tokenizer = tokenizer
            self.requires_text = True
            self.requires_images = False

        def fit(self, allow_empty_categories: bool = False, *args, **kwargs):
            for category in self.categories:
                # method exclusive_first_page_strings fetches a set of first-page strings exclusive among the Documents
                # of a given Category. they can be found in _exclusive_first_page_strings attribute of a Category after
                # the method has been run. this is needed so that the information remains even if local variable
                # cur_first_page_strings is lost.
                cur_first_page_strings = category.exclusive_first_page_strings(tokenizer=self.tokenizer)
                if not cur_first_page_strings:
                    if allow_empty_categories:
                        logger.warning(
                            f'No exclusive first-page strings were found for {category}, so it will not be used '
                            f'at prediction.'
                        )
                    else:
                        raise ValueError(f'No exclusive first-page strings were found for {category}.')

        def predict(self, page: Page) -> Page:
            for category in self.categories:
                if not category.exclusive_first_page_strings(tokenizer=self.tokenizer):
                    # exclusive_first_page_strings calls an implicit _exclusive_first_page_strings attribute once it was
                    # already calculated during fit() method so it is not a recurrent calculation each time.
                    raise ValueError(f"Cannot run prediction as {category} does not have _exclusive_first_page_strings.")
            page.is_first_page = False
            for category in self.categories:
                cur_first_page_strings = category.exclusive_first_page_strings(tokenizer=self.tokenizer)
                intersection = {span.offset_string for span in page.spans()}.intersection(
                    cur_first_page_strings
                )
                if len(intersection) > 0:
                    page.is_first_page = True
                    break
            return page

# initialize a Project and fetch a test Document of your choice
project = Project(id_=YOUR_PROJECT_ID)
test_document = project.get_document_by_id(YOUR_DOCUMENT_ID)

# initialize a ContextAwareFileSplittingModel and fit it

file_splitting_model = ContextAwareFileSplittingModel(categories=project.categories, tokenizer=ConnectedTextTokenizer())
file_splitting_model.fit()

# save the model
file_splitting_model.save()

# run the prediction
for page in test_document.pages():
 pred = file_splitting_model.predict(page)
 if pred.is_first_page:
  print('Page {} is predicted as the first.'.format(page.number))
 else:
  print('Page {} is predicted as the non-first.'.format(page.number))

# usage with the SplittingAI – you can load a pre-saved model or pass an initialized instance as the input
# in this example, we load a previously saved one
model = load_model(project.model_folder)

# initialize the SplittingAI
splitting_ai = SplittingAI(model)

# SplittingAI is a more high-level interface to ContextAwareFileSplittingModel and any other models that can be
# developed for file-splitting purposes. It takes a Document as an input, rather than individual Pages, because it
# utilizes page-level prediction of possible split points and returns Document or Documents with changes depending on
# the prediction mode.

# SplittingAI can be ran in two modes: returning a list of sub-Documents as the result of the input Document
# splitting or returning a copy of the input Document with Pages predicted as first having an attribute
# "is_first_page". The flag "return_pages" has to be True for the latter; let's use it
new_document = splitting_ai.propose_split_documents(test_document, return_pages=True)
print(new_document)
# output: [predicted_document]

for page in new_document[0].pages():
    if page.is_first_page:
        print('Page {} is predicted as the first.'.format(page.number))
    else:
        print('Page {} is predicted as the non-first.'.format(page.number))

Split multi-file Document into Separate files without training a model

Let’s see how to use the konfuzio_sdk to automatically split documents consisting of several files. We will be using a pre-built class SplittingAI. The class implements a context-aware rule-based logic that requires no training.

from konfuzio_sdk.data import Project
from konfuzio_sdk.tokenizer.regex import ConnectedTextTokenizer
from konfuzio_sdk.trainer.file_splitting import ContextAwareFileSplittingModel, SplittingAI
from konfuzio_sdk.trainer.information_extraction import load_model

project = Project(id_=YOUR_PROJECT_ID)
test_document = project.get_document_by_id(YOUR_DOCUMENT_ID)

# initialize a ContextAwareFileSplittingModel and fit it

file_splitting_model = ContextAwareFileSplittingModel(categories=project.categories, tokenizer=ConnectedTextTokenizer())
file_splitting_model.fit()

# save the model
file_splitting_model.output_dir = project.model_folder
file_splitting_model.save()

# run the prediction
for page in test_document.pages():
    pred = file_splitting_model.predict(page)
    if pred.is_first_page:
        print('Page {} is predicted as the first.'.format(page.number))
    else:
        print('Page {} is predicted as the non-first.'.format(page.number))

# usage with the SplittingAI – you can load a pre-saved model or pass an initialized instance as the input
# in this example, we load a previously saved one
model = load_model(project.model_folder)

# initialize the SplittingAI
splitting_ai = SplittingAI(model)

# SplittingAI is a more high-level interface to ContextAwareFileSplittingModel and any other models that can be
# developed for file-splitting purposes. It takes a Document as an input, rather than individual Pages, because it
# utilizes page-level prediction of possible split points and returns Document or Documents with changes depending on
# the prediction mode.

# SplittingAI can be ran in two modes: returning a list of sub-Documents as the result of the input Document
# splitting or returning a copy of the input Document with Pages predicted as first having an attribute
# "is_first_page". The flag "return_pages" has to be True for the latter; let's use it
new_document = splitting_ai.propose_split_documents(test_document, return_pages=True)
print(new_document)
# output: [predicted_document]

for page in new_document[0].pages():
    if page.is_first_page:
        print('Page {} is predicted as the first.'.format(page.number))
    else:
        print('Page {} is predicted as the non-first.'.format(page.number))

FileSplittingEvaluation class

FileSplittingEvaluation class can be used to evaluate performance of ContextAwareFileSplittingModel, returning a set of metrics that includes precision, recall, f1 measure, true positives, false positives and false negatives.

The class’s methods calculate() and calculate_by_category() are run at initialization. The class receives two lists of Documents as an input – first list consists of ground-truth Documents where all first Pages are marked as such, second is of Documents on Pages of which FileSplittingModel ran a prediction of them being first or non-first.

The initialization would look like this:

evaluation = FileSplittingEvaluation(ground_truth_documents=YOUR_GROUND_TRUTH_LIST,
                                     prediction_documents=YOUR_PREDICTION_LIST)

The class compares each pair of Pages. If a Page is labeled as first and the model also predicted it as first, it is considered a true positive. If a Page is labeled as first but the model predicted it as non-first, it is considered a false negative. If a Page is labeled as non-first but the model predicted it as first, it is considered a false positive. If a Page is labeled as non-first and the model also predicted it as non-first, it is considered a true negative.

predicted correctly

predicted incorrectly

first Page

TP

FN

non-first Page

TN

FP

After iterating through all Pages of all Documents, precision, recall and f1 measure are calculated. If you wish to set metrics to None in case there has been an attempt of zero division, set allow_zero=True at the initialization.

To see a certain metric after the class has been initialized, you can call a metric’s method:

print(evaluation.fn())

It is also possible to look at the metrics calculated by each Category independently. For this, pass search=YOUR_CATEGORY_HERE when calling the wanted metric’s method:

print(evaluation.fn(search=YOUR_CATEGORY_HERE))

For more details, see the Python API Documentation on Evaluation.

Example of evaluation input and output

Suppose in our test dataset we have 2 Documents of 2 Categories: one 3-paged, consisting of a single file (-> it has only one ground-truth first Page) of a first Category, and one 5-paged, consisting of three files: two 2-paged and one 1-paged (-> it has three ground-truth first Pages), of a second Category.

../../_images/document_example_1.png

First document

../../_images/document_example_2.png

Second document

# generate the test Documents

from konfuzio_sdk.data import Category, Project, Document, Page
from konfuzio_sdk.evaluate import FileSplittingEvaluation
from konfuzio_sdk.trainer.file_splitting import SplittingAI

text_1 = "Hi all,\nI like bread.\nI hope to get everything done soon.\nHave you seen it?"
document_1 = Document(id_=None, project=YOUR_PROJECT, category=YOUR_CATEGORY_1, text=text_1, dataset_status=3)
_ = Page(
        id_=None,
        original_size=(320, 240),
        document=document_1,
        start_offset=0,
        end_offset=21,
        number=1,
    )
_ = Page(
    id_=None,
    original_size=(320, 240),
    document=document_1,
    start_offset=22,
    end_offset=57,
    number=2,
)

_ = Page(
    id_=None,
    original_size=(320, 240),
    document=document_1,
    start_offset=58,
    end_offset=75,
    number=3,
)

text_2 = "Good evening,\nthank you for coming.\nCan you give me that?\nI need it.\nSend it to me."
document_2 = Document(id_=None, project=YOUR_PROJECT, category=YOUR_CATEGORY_2, text=text_2, dataset_status=3)
_ = Page(
    id_=None,
    original_size=(320, 240),
    document=document_2,
    start_offset=0,
    end_offset=12,
    number=1
)
_ = Page(
    id_=None,
    original_size=(320, 240),
    document=document_2,
    start_offset=13,
    end_offset=34,
    number=2
)
_ = Page(
    id_=None,
    original_size=(320, 240),
    document=document_2,
    start_offset=35,
    end_offset=56,
    number=3
)
_.is_first_page = True
_ = Page(
    id_=None,
    original_size=(320, 240),
    document=document_2,
    start_offset=57,
    end_offset=67,
    number=4
)
_ = Page(
    id_=None,
    original_size=(320, 240),
    document=document_2,
    start_offset=68,
    end_offset=82,
    number=5
)
_.is_first_page = True

We need to pass two lists of Documents into the FileSplittingEvaluation class. So, before that, we need to run each page of the documents through the model’s prediction.

Let’s say the evaluation gave good results, with only one first page being predicted as non-first and all the other pages being predicted correctly. An example of how the evaluation would be implemented would be:

splitting_ai = SplittingAI(YOUR_MODEL_HERE)
pred_1: Document = splitting_ai.propose_split_documents(document_1, return_pages=True)[0]
pred_2: Document = splitting_ai.propose_split_documents(document_2, return_pages=True)[0]
evaluation = FileSplittingEvaluation(ground_truth_documents=[document_1, document_2],
                                     prediction_documents=[pred_1, pred_2])
print(evaluation.tp()) # returns: 3
print(evaluation.tn()) # returns: 4
print(evaluation.fp()) # returns: 0
print(evaluation.fn()) # returns: 1
print(evaluation.precision()) # returns: 1
print(evaluation.recall()) # returns: 0.75
print(evaluation.f1()) # returns: 0.85

Our results could be reflected in a following table:

TPs

TNs

FPs

FNs

precision

recall

F1

3

4

0

1

1

0.75

0.85

If we want to see evaluation results by Category, the implementation of the Evaluation would look like this:

print(evaluation.tp(search=CATEGORY_1), evaluation.tp(search=CATEGORY_2)) # returns: 1 2
print(evaluation.tn(search=CATEGORY_1), evaluation.tn(search=CATEGORY_2)) # returns: 2 2
print(evaluation.fp(search=CATEGORY_1), evaluation.fp(search=CATEGORY_2)) # returns: 0 0
print(evaluation.fn(search=CATEGORY_1), evaluation.fn(search=CATEGORY_2)) # returns: 0 1
print(evaluation.precision(search=CATEGORY_1), evaluation.precision(search=CATEGORY_2)) # returns: 1 1
print(evaluation.recall(search=CATEGORY_1), evaluation.recall(search=CATEGORY_2)) # returns: 1 0.66
print(evaluation.f1(search=CATEGORY_1), evaluation.f1(search=CATEGORY_2)) # returns: 1 0.79

the output could be reflected in a following table:

Category

TPs

TNs

FPs

FNs

precision

recall

F1

Category 1

1

2

0

0

1

1

1

Category 2

2

2

0

1

1

0.66

0.79

To log metrics after evaluation, you can call EvaluationCalculator‘s method metrics_logging (you would need to specify the metrics accordingly at the class’s initialization). Example usage:

EvaluationCalculator(tp=3, fp=0, fn=1, tn=4).metrics_logging()

Document Categorization

Categorization Fallback Logic

Use the name of the category as an effective fallback logic to categorize documents when no categorization AI is available:

from konfuzio_sdk.data import Project
from konfuzio_sdk.trainer.document_categorization import FallbackCategorizationModel

project = Project(id_=YOUR_PROJECT_ID)
categorization_model = FallbackCategorizationModel(project)
categorization_model.categories = project.categories

test_document = project.get_document_by_id(YOUR_DOCUMENT_ID)

# returns virtual SDK Document with category attribute
result_doc = categorization_model.categorize(document=test_document)

# if the input document is already categorized, the already present category is used
# unless recategorize is True
result_doc = categorization_model.categorize(document=test_document, recategorize=True)

print(f"Found category {result_doc.category} for {result_doc}")

# option to modify the provided document in place
categorization_model.categorize(document=test_document, inplace=True)

print(f"Found category {test_document.category} for {test_document}")

Train a Konfuzio SDK Model to Extract Information From Payslip Documents

The tutorial RFExtractionAI Demo aims to show you how to use the Konfuzio SDK package to use a simple Whitespace tokenizer and to train a “RFExtractionAI” model to find and extract relevant information like Name, Date and Recipient from payslip documents.

You can OpenInColab or download it from here and try it by yourself.

Retrain Flair NER-Ontonotes-Fast with Human Revised Annotations

The tutorial HRetrain Flair NER-Ontonotes-Fast with Human Revised Annotations aims to show you how to use the Konfuzio SDK package to include an easy feedback workflow in your training pipeline. It also gives an example of how you can take advantage of open-source models to speed up the annotation process and use the feedback workflow to adapt the domain knowledge of the model to your aim.

You can OpenInColab1 or download it from here and try it by yourself.

Count Relevant Expressions in Annual Reports

The tutorial Count Relevant Expressions in Annual Reports aims to show you how to use the Konfuzio SDK package to retrieve structured and organized information that can be used for a deeper analysis and understanding of your data. It will show you how to identify and count pre-specified expressions in documents and how to collect that information.

You can OpenInColab2 or download it from here and try it by yourself.