Code Examples¶
Example Usage¶
Project¶
Retrieve all information available for your project:
my_project = Project(id_=YOUR_PROJECT_ID)
The information will be stored in the folder that you defined to allocate the data in the package initialization. A subfolder will be created for each document in the project.
Every time that there are changes in the project in the Konfuzio Server, the local project can be updated this way:
my_project.get(update=True)
To make sure that your project is loaded with all the latest data:
my_project = Project(id_=YOUR_PROJECT_ID, update=True)
Documents¶
To access the documents in the project you can use:
documents = my_project.documents
By default, it will get the documents with training status (dataset_status = 2). The code for the status is:
None: 0
Preparation: 1
Training: 2
Test: 3
Excluded: 4
The test documents can be accessed directly by:
test_documents = my_project.test_documents
For more details, you can checkout the Project documentation.
By default, you get 4 files for each document that contain information of the text, pages, annotation sets and annotations. You can see these files inside the document folder.
document.txt - Contains the text of the document. If OCR was used, it will correspond to the result from the OCR.
x02 328927/10103/00104
Abrechnung der Brutto/Netto-Bezüge für Dezember 2018 22.05.2018 Bat: 1
Personal-Nr. Geburtsdatum ski Faktor Ki,Frbtr.Konfession ‚Freibetragjährl.! |Freibetrag mt! |DBA iGleitzone 'St.-Tg. VJuUr. üb. |Url. Anspr. Url.Tg.gen. |Resturlaub
00104 150356 1 | ‚ev 30 400 3000 3400
SV-Nummer |Krankenkasse KK%®|PGRS Bars jum.SV-Tg. Anw. Tage |UrlaubTage Krankh. Tg. Fehlz. Tage
50150356B581 AOK Bayern Die Gesundheitskas 157 101 1111 1 30
Eintritt ‚Austritt Anw.Std. Urlaub Std. |Krankh. Std. |Fehlz. Std.
170299 L L l L l l
- + Steuer-ID IMrB? Zeitlohn Sta.|Überstd. |Bez. Sta.
Teststraße123
12345 Testort 12345678911 ı ı \
B/N
Pers.-Nr. 00104 x02
Abt.-Nr. A12 10103 HinweisezurAbrechnung
pages.json5 - Contains information of each page of the document (for example, their ids and sizes).
[
{
"id": 1923,
"image": "/page/show/1923/",
"number": 1,
"original_size": [
595.2,
841.68
],
"size": [
1414,
2000
]
}
]
annotation_sets.json5 - Contains information of each section in the document (for example, their ids and label sets).
[
{
"id": 78730,
"position": 1,
"section_label": 63
},
{
"id": 292092,
"position": 1,
"section_label": 64
}
]
annotations.json5 - Contains information of each annotation in the document (for example, their labels and bounding boxes).
[
{
"accuracy": null,
"bbox": {
"bottom": 44.369,
"line_index": 1,
"page_index": 0,
"top": 35.369,
"x0": 468.48,
"x1": 527.04,
"y0": 797.311,
"y1": 806.311
},
"bboxes": [
{
"bottom": 44.369,
"end_offset": 169,
"line_number": 2,
"offset_string": "22.05.2018",
"offset_string_original": "22.05.2018",
"page_index": 0,
"start_offset": 159,
"top": 35.369,
"x0": 468.48,
"x1": 527.04,
"y0": 797.311,
"y1": 806.311
}
],
"created_by": 59,
"custom_offset_string": false,
"end_offset": 169,
"get_created_by": "[email protected]",
"get_revised_by": "n/a",
"id": 4419937,
"is_correct": true,
"label": 867,
"label_data_type": "Date",
"label_text": "Austellungsdatum",
"label_threshold": 0.1,--
"normalized": "2018-05-22",
"offset_string": "22.05.2018",
"offset_string_original": "22.05.2018",
"revised": false,
"revised_by": null,
"section": 78730,
"section_label_id": 63,
"section_label_text": "Lohnabrechnung",
"selection_bbox": {
"bottom": 44.369,
"line_index": 1,
"page_index": 0,
"top": 35.369,
"x0": 468.48,
"x1": 527.04,
"y0": 797.311,
"y1": 806.311
},
"start_offset": 159,
"translated_string": null
},
...
]
Download PDFs¶
To get the PDFs of the documents, you can use get_file().
for document in my_project.documents:
document.get_file()
This will download the OCR version of the document which contains the text, the bounding boxes information of the characters and the image of the document.
In the document folder, you will see a new file with the original name followed by “_ocr”.
If you want to original version of the document (without OCR) you can use ocr_version=False.
for document in my_project.documents:
document.get_file(ocr_version=False)
In the document folder, you will see a new file with the original name.
Download pages as images¶
To get the pages of the document as png images, you can use get_images().
for document in my_project.documents:
document.get_images()
You will get one png image named “page_number_of_page.png” for each page in the document.
Download bounding boxes of the characters¶
To get the bounding boxes information of the characters, you can use get_bbox().
for document in my_project.documents:
document.get_bbox()
You will get a file named “bbox.json5”.
After downloading these files, the paths to them will also become available in the project instance. For example, you can get the path to the file with the document text with:
my_project.documents_folder
Update Document¶
If there are changes in the document in the Konfuzio Server, you can update the document with:
document.update()
If a document is part of the training or test set, you can also update it by updating the entire project via project.update(). However, for projects with many documents it can be faster to update only the relevant documents.
Delete Document¶
To locally delete a document, you can use:
document.delete()
The document will be deleted from your local data folder but it will remain in the Konfuzio Server. If you want to get it again you can update the project.
If you want to delete a document permanently you can do it like so:
document.delete(delete_online=True)
Upload Document¶
To upload a new Document in your Project using the SDK, you have the option between two Document methods: from_file_sync
and from_file_async
.
If you want to upload a document, and start working with it as soon as the OCR processing step is done, we recommend from_file_sync
as it will
wait for the Document to be processed and then return a ready Document. Beware, this may take from a few seconds up to over a minute.
document = Document.from_file_sync(FILE_PATH, project=my_project)
If however you are trying to upload a large number of files and don’t wait to wait for them to be processed you can use the asynchronous function which only returns a document ID:
document_id = Document.from_file_async(FILE_PATH, project=my_project)
Later, you can load the processed document and get your document with:
my_project.init_or_update_document(self, from_online=False)
document = my_project.get_document_by_id(document_id)
Modify Document¶
If you would like to use the SDK to modify some document’s meta-data like the dataset status or the assignee, you can do it like this:
document.assignee = ASSIGNEE_ID
document.dataset_status = 3
doc.save_meta_data()
Here, the assignee has been changed in the server to the user with id 43, and the status of the document has been changed to 3 (i.e. Testing).
Delete Document¶
If you would like to delete a Document in the remote server you can simply use the Document.delete
method. You can only delete Documents with a dataset status of None (0). Be careful! Once the document is deleted online, we will have no way of recovering it.
document.dataset_status = 0
doc.save_meta_data()
doc.delete(delete_online=True)
If delete_online
is set to False (the default), the Document will only be deleted on your local machine, and will be reloaded next time you load the Project, or if you run the Project.init_or_update_document
method directly.
Create Regex-based Annotations¶
Let’s see a simple example of how we can use the konfuzio_sdk
package to get information on a project and to post annotations.
You can follow the example below to post annotations of a certain word or expression in the first document uploaded.
import re
from konfuzio_sdk.data import Project, Annotation, Label
my_project = Project(id_=YOUR_PROJECT_ID)
# Word/expression to annotate in the document
# should match an existing one in your document
input_expression = "John Smith"
# Label for the annotation
label_name = "Name"
# Getting the Label from the project
my_label = my_project.get_label_by_name(label_name)
# LabelSet to which the Label belongs
label_set = my_label.label_sets[0]
# First document in the project
document = my_project.documents[0]
# Matches of the word/expression in the document
matches_locations = [(m.start(0), m.end(0)) for m in re.finditer(input_expression, document.text)]
# List to save the links to the annotations created
new_annotations_links = []
# Create annotation for each match
for offsets in matches_locations:
span = Span(start_offset=offsets[0], end_offset=offsets[1])
annotation_obj = Annotation(
document=document,
label=my_label,
label_set=label_set,
confidence=1.0,
spans=[span],
is_correct=True
)
new_annotation_added = annotation_obj.save()
if new_annotation_added:
new_annotations_links.append(annotation_obj.get_link())
print(new_annotations_links)
Train Label Regex Tokenizer¶
You can use the konfuzio_sdk
package to train a custom Regex tokenizer.
In this example, you will see how to find regex expressions that match with occurences of the “IBAN” Label in the training data.
from konfuzio_sdk.data import Project
from konfuzio_sdk.tokenizer.regex import RegexTokenizer
from konfuzio_sdk.tokenizer.base import ListTokenizer
my_project = Project(id_=YOUR_PROJECT_ID)
category = project.get_category_by_id(id_=CATEGORY_ID)
tokenizer = ListTokenizer(tokenizers=[])
iban_label = my_project.get_label_by_name("IBAN")
for regex in iban_label.find_regex(category=category):
regex_tokenizer = RegexTokenizer(regex=regex)
tokenizer.tokenizers.append(regex_tokenizer)
# You can then use it to create an Annotation for every matching string in a document.
document = project.get_document_by_id(DOCUMENT_ID)
tokenizer.tokenize(document)
Finding Spans of a Label Not Found by a Tokenizer¶
Here is an example of how to use the Label.spans_not_found_by_tokenizer
method. This will allow you to determine if a RegexTokenizer is suitable at finding the Spans of a Label, or what Spans might have been annotated wrong. Say, you have a number of annotations assigned to the IBAN
Label and want to know which Spans would not be found when using the WhiteSpace Tokenizer. You can follow this example to find all the relevant Spans.
from konfuzio_sdk.data import Project, Annotation, Label
from konfuzio_sdk.tokenizer.regex import WhitespaceTokenizer
my_project = Project(id_=YOUR_PROJECT_ID)
category = Project.categories[0]
tokenizer = WhitespaceTokenizer()
iban_label = project.get_label_by_name('IBAN')
spans_not_found = iban_label.spans_not_found_by_tokenizer(tokenizer, categories=[category])
for span in spans_not_found:
print(f"{span}: {span.offset_string}")
Evaluate a Trained Extraction AI Model¶
In this example we will see how we can evaluate a trained RFExtractionAI
model. We will assume that we have a trained pickled model available. See here for how to train such a model, and check out the Evaluation documentation for more details.
from konfuzio_sdk.data import Project
from konfuzio_sdk.trainer.information_extraction import load_model
pipeline = load_model(MODEL_PATH)
# To get the evaluation of the full pipeline
evaluation = pipeline.evaluate_full()
print(f"Full evaluation F1 score: {evaluation.f1()}")
print(f"Full evaluation recall: {evaluation.recall()}")
print(f"Full evaluation precision: {evaluation.precision()}")
# To get the evaluation of the tokenizer alone
evaluation = pipeline.evaluate_tokenizer()
print(f"Tokenizer evaluation F1 score: {evaluation.tokenizer_f1()}")
# To get the evaluation of the Label classifier given perfect tokenization
evaluation = pipeline.evaluate_clf()
print(f"Label classifier evaluation F1 score: {evaluation.clf_f1()}")
# To get the evaluation of the LabelSet given perfect Label classification
evaluation = pipeline.evaluate_label_set_clf()
print(f"Label Set evaluation F1 score: {evaluation.f1()}")
Architecture overview¶
We’ll take a closer look at the most important ones and give you some examples of how they can be implemented consequentially or individually in case you want to experiment.
The first step we’re going to cover is File Splitting – this happens when the original Document consists of several smaller sub-Documents and needs to be separated so that each one can be processed individually.
Second part is on Categorization, where a Document is labelled to be of a certain Category within the Project.
Third part describes Information Extraction, during which various information is obtained from unstructured texts, i.e. Name, Date, Recipient, or any other custom Labels.
For a more in-depth look at each step, be sure to check out the diagram that reflects each step of the document-processing pipeline.
Splitting for multi-file Documents: Step-by-step guide¶
Intro¶
It’s common for multipage files to not be perfectly organized, and in some cases, multiple independent Documents may be included in a single file. To ensure that these Documents are properly processed and separated, we will be discussing a method for identifying and splitting them into individual, independent sub-documents.

Multi-file Document Example
In this section, we will explore an easy method for identifying and separating Documents that may be included in a single file. Our approach involves analyzing the contents of each Page and identifying similarities to the first Pages of the Document. This will allow us to define splitting points and divide the Document into multiple sub-documents. It’s important to note that this approach is only effective for Documents written in the same language and that the process must be repeated for each Category.
If you are unfamiliar with the SDK’s main concepts (like Page or Span), you can get to know them on the Quickstart page.
Quick explanation¶
The first step in implementing this method is “training”: this involves tokenizing the Document by splitting its text into parts, specifically into strings without line breaks. We then gather the exclusive strings from Spans, which are the parts of the text in the Page, and compare them to the first Pages of each Document in the training data.
Once we have identified these strings, we can use them to determine whether a Page in an input Document is a first Page or not. We do this by going through the strings in the Page and comparing them to the set of strings collected in the training stage. If we find at least one string that intersects between the current Page and the strings from the first step, we believe it is the first Page.
Note that the more Documents we use in the training stage, the less intersecting strings we are likely to find. If you find that your set of first-page strings is empty, try using a smaller slice of the dataset instead of the whole set. Generally, when used on Documents within the same Category, this algorithm should not return an empty set. If that is the case, it’s worth checking if your data is consistent, for example, not in different languages or containing other Categories.
Step-by-step explanation¶
In this section, we will walk you through the process of setting up the ContextAwareFileSplittingModel
class, which
can be found in the code block at the bottom of this page. This class is already implemented and can be imported using
from konfuzio_sdk.trainer.file_splitting import ContextAwareFileSplittingModel
.
Note that any custom FileSplittingAI (derived from AbstractFileSplittingModel
class) requires having the following
methods implemented:
__init__
to initialize key variables required by the custom AI;fit
to define architecture and training that the model undergoes, i.e. a certain NN architecture or a customhardcoded logic
predict
to define how the model classifies Pages as first or non-first. NB: the classification needs to be run on the Page level, not the Document level – the result of classification is reflected inis_first_page
attribute value, which is unique to the Page class and is not present in Document class. Pages withis_first_page = True
become splitting points, thus, each new sub-Document has a Page predicted as first as its starting point.
To begin, we will make all the necessary imports and initialize the ContextAwareFileSplittingModel
class:
import logging
from typing import List
from konfuzio_sdk.data import Page, Category
from konfuzio_sdk.trainer.file_splitting import AbstractFileSplittingModel
from konfuzio_sdk.trainer.information_extraction import load_model
from konfuzio_sdk.tokenizer.regex import ConnectedTextTokenizer
class ContextAwareFileSplittingModel(AbstractFileSplittingModel):
def __init__(self, categories: List[Category], tokenizer, *args, **kwargs):
super().__init__(categories=categories)
self.name = self.__class__.__name__
self.output_dir = self.project.model_folder
self.tokenizer = tokenizer
self.requires_text = True
self.requires_images = False
The class inherits from AbstractFileSplittingModel
, so we run super().__init__(categories=categories)
to properly
inherit its attributes. The tokenizer
attribute will be used to process the text within the Document, separating it
into Spans. This is done to ensure that the text in all the Documents is split using the same logic (particularly
tokenization by separating on \n
whitespaces by ConnectedTextTokenizer, which is used in the example in the end of the
page) and it will be possible to find common Spans. It will be used for training and testing Documents as well as any
Document that will undergo splitting. It’s important to note that if you run fitting with one tokenizer and then
reassign it within the same instance of the model, all previously gathered strings will be deleted and replaced by new
ones. requires_images
and requires_text
determine whether these types of data are used for prediction; this is
needed for distinguishing between preprocessing types once a model is passed into the SplittingAI.
An example of how ConnectedTextTokenizer works:
# before tokenization
test_document = project.get_document_by_id(YOUR_DOCUMENT_ID)
test_document.text
# output: "This is an example text. \n It has several lines. \n Here it finishes."
test_document.spans()
# output: []
test_document = tokenizer.tokenize(test_document)
# after tokenization
test_document.spans()
# output: [Span (0, 24), Span(25, 47), Span(48, 65)]
test_document.spans[0].offset_string
# output: "This is an example text. "
The first method to define will be the fit()
method. For each Category, we call exclusive_first_page_strings
method,
which allows us to gather the strings that appear on the first Page of each Document. allow_empty_categories
allows
for returning empty lists for Categories that haven’t had any exclusive first-page strings found across their Documents.
This means that those Categories would not be used in the prediction process.
def fit(self, allow_empty_categories: bool = False, *args, **kwargs):
for category in self.categories:
# method exclusive_first_page_strings fetches a set of first-page strings exclusive among the Documents
# of a given Category. they can be found in _exclusive_first_page_strings attribute of a Category after
# the method has been run. this is needed so that the information remains even if local variable
# cur_first_page_strings is lost.
cur_first_page_strings = category.exclusive_first_page_strings(tokenizer=self.tokenizer)
if not cur_first_page_strings:
if allow_empty_categories:
logger.warning(
f'No exclusive first-page strings were found for {category}, so it will not be used '
f'at prediction.'
)
else:
raise ValueError(f'No exclusive first-page strings were found for {category}.')
Lastly, we define predict()
method. The method accepts a Page as an input and checks its Span set for containing
first-page strings for each of the Categories. If there is at least one intersection, the Page is predicted to be a
first Page. If there are no intersections, the Page is predicted to be a non-first Page.
def predict(self, page: Page) -> Page:
for category in self.categories:
# exclusive_first_page_strings calls an implicit _exclusive_first_page_strings attribute once it was
# already calculated during fit() method so it is not a recurrent calculation each time.
if not category.exclusive_first_page_strings:
raise ValueError(f"Cannot run prediction as {category} does not have exclusive_first_page_strings.")
page.is_first_page = False
for category in self.categories:
cur_first_page_strings = category.exclusive_first_page_strings(tokenizer=self.tokenizer)
intersection = {span.offset_string for span in page.spans()}.intersection(
cur_first_page_strings
)
if len(intersection) > 0:
page.is_first_page = True
break
return page
A quick example of the class’s usage:
# initialize a Project and fetch a test Document of your choice
project = Project(id_=YOUR_PROJECT_ID)
test_document = project.get_document_by_id(YOUR_DOCUMENT_ID)
# initialize a ContextAwareFileSplittingModel and fit it
file_splitting_model = ContextAwareFileSplittingModel(categories=project.categories, tokenizer=ConnectedTextTokenizer())
file_splitting_model.fit()
# save the model
file_splitting_model.output_dir = project.model_folder
file_splitting_model.save()
# run the prediction
for page in test_document.pages():
pred = file_splitting_model.predict(page)
if pred.is_first_page:
print('Page {} is predicted as the first.'.format(page.number))
else:
print('Page {} is predicted as the non-first.'.format(page.number))
# usage with the SplittingAI – you can load a pre-saved model or pass an initialized instance as the input
# in this example, we load a previously saved one
model = load_model(project.model_folder)
# initialize the SplittingAI
splitting_ai = SplittingAI(model)
# SplittingAI is a more high-level interface to ContextAwareFileSplittingModel and any other models that can be
# developed for file-splitting purposes. It takes a Document as an input, rather than individual Pages, because it
# utilizes page-level prediction of possible split points and returns Document or Documents with changes depending on
# the prediction mode.
# SplittingAI can be ran in two modes: returning a list of sub-Documents as the result of the input Document
# splitting or returning a copy of the input Document with Pages predicted as first having an attribute
# "is_first_page". The flag "return_pages" has to be True for the latter; let's use it
new_document = splitting_ai.propose_split_documents(test_document, return_pages=True)
print(new_document)
# output: [predicted_document]
for page in new_document[0].pages():
if page.is_first_page:
print('Page {} is predicted as the first.'.format(page.number))
else:
print('Page {} is predicted as the non-first.'.format(page.number))
Full code:
import logging
from typing import List
from konfuzio_sdk.data import Page, Category
from konfuzio_sdk.trainer.file_splitting import AbstractFileSplittingModel
from konfuzio_sdk.trainer.information_extraction import load_model
from konfuzio_sdk.tokenizer.regex import ConnectedTextTokenizer
logger = logging.getLogger(__name__)
class ContextAwareFileSplittingModel(AbstractFileSplittingModel):
"""Fallback definition of a File Splitting Model."""
class ContextAwareFileSplittingModel(AbstractFileSplittingModel):
def __init__(self, categories: List[Category], tokenizer, *args, **kwargs):
super().__init__(categories=categories)
self.name = self.__class__.__name__
self.output_dir = self.project.model_folder
self.tokenizer = tokenizer
self.requires_text = True
self.requires_images = False
def fit(self, allow_empty_categories: bool = False, *args, **kwargs):
for category in self.categories:
# method exclusive_first_page_strings fetches a set of first-page strings exclusive among the Documents
# of a given Category. they can be found in _exclusive_first_page_strings attribute of a Category after
# the method has been run. this is needed so that the information remains even if local variable
# cur_first_page_strings is lost.
cur_first_page_strings = category.exclusive_first_page_strings(tokenizer=self.tokenizer)
if not cur_first_page_strings:
if allow_empty_categories:
logger.warning(
f'No exclusive first-page strings were found for {category}, so it will not be used '
f'at prediction.'
)
else:
raise ValueError(f'No exclusive first-page strings were found for {category}.')
def predict(self, page: Page) -> Page:
for category in self.categories:
if not category.exclusive_first_page_strings(tokenizer=self.tokenizer):
# exclusive_first_page_strings calls an implicit _exclusive_first_page_strings attribute once it was
# already calculated during fit() method so it is not a recurrent calculation each time.
raise ValueError(f"Cannot run prediction as {category} does not have _exclusive_first_page_strings.")
page.is_first_page = False
for category in self.categories:
cur_first_page_strings = category.exclusive_first_page_strings(tokenizer=self.tokenizer)
intersection = {span.offset_string for span in page.spans()}.intersection(
cur_first_page_strings
)
if len(intersection) > 0:
page.is_first_page = True
break
return page
# initialize a Project and fetch a test Document of your choice
project = Project(id_=YOUR_PROJECT_ID)
test_document = project.get_document_by_id(YOUR_DOCUMENT_ID)
# initialize a ContextAwareFileSplittingModel and fit it
file_splitting_model = ContextAwareFileSplittingModel(categories=project.categories, tokenizer=ConnectedTextTokenizer())
file_splitting_model.fit()
# save the model
file_splitting_model.save()
# run the prediction
for page in test_document.pages():
pred = file_splitting_model.predict(page)
if pred.is_first_page:
print('Page {} is predicted as the first.'.format(page.number))
else:
print('Page {} is predicted as the non-first.'.format(page.number))
# usage with the SplittingAI – you can load a pre-saved model or pass an initialized instance as the input
# in this example, we load a previously saved one
model = load_model(project.model_folder)
# initialize the SplittingAI
splitting_ai = SplittingAI(model)
# SplittingAI is a more high-level interface to ContextAwareFileSplittingModel and any other models that can be
# developed for file-splitting purposes. It takes a Document as an input, rather than individual Pages, because it
# utilizes page-level prediction of possible split points and returns Document or Documents with changes depending on
# the prediction mode.
# SplittingAI can be ran in two modes: returning a list of sub-Documents as the result of the input Document
# splitting or returning a copy of the input Document with Pages predicted as first having an attribute
# "is_first_page". The flag "return_pages" has to be True for the latter; let's use it
new_document = splitting_ai.propose_split_documents(test_document, return_pages=True)
print(new_document)
# output: [predicted_document]
for page in new_document[0].pages():
if page.is_first_page:
print('Page {} is predicted as the first.'.format(page.number))
else:
print('Page {} is predicted as the non-first.'.format(page.number))
Split multi-file Document into Separate files without training a model¶
Let’s see how to use the konfuzio_sdk
to automatically split documents consisting of
several files. We will be using a pre-built class SplittingAI. The class implements a context-aware rule-based logic
that requires no training.
from konfuzio_sdk.data import Project
from konfuzio_sdk.tokenizer.regex import ConnectedTextTokenizer
from konfuzio_sdk.trainer.file_splitting import ContextAwareFileSplittingModel, SplittingAI
from konfuzio_sdk.trainer.information_extraction import load_model
project = Project(id_=YOUR_PROJECT_ID)
test_document = project.get_document_by_id(YOUR_DOCUMENT_ID)
# initialize a ContextAwareFileSplittingModel and fit it
file_splitting_model = ContextAwareFileSplittingModel(categories=project.categories, tokenizer=ConnectedTextTokenizer())
file_splitting_model.fit()
# save the model
file_splitting_model.output_dir = project.model_folder
file_splitting_model.save()
# run the prediction
for page in test_document.pages():
pred = file_splitting_model.predict(page)
if pred.is_first_page:
print('Page {} is predicted as the first.'.format(page.number))
else:
print('Page {} is predicted as the non-first.'.format(page.number))
# usage with the SplittingAI – you can load a pre-saved model or pass an initialized instance as the input
# in this example, we load a previously saved one
model = load_model(project.model_folder)
# initialize the SplittingAI
splitting_ai = SplittingAI(model)
# SplittingAI is a more high-level interface to ContextAwareFileSplittingModel and any other models that can be
# developed for file-splitting purposes. It takes a Document as an input, rather than individual Pages, because it
# utilizes page-level prediction of possible split points and returns Document or Documents with changes depending on
# the prediction mode.
# SplittingAI can be ran in two modes: returning a list of sub-Documents as the result of the input Document
# splitting or returning a copy of the input Document with Pages predicted as first having an attribute
# "is_first_page". The flag "return_pages" has to be True for the latter; let's use it
new_document = splitting_ai.propose_split_documents(test_document, return_pages=True)
print(new_document)
# output: [predicted_document]
for page in new_document[0].pages():
if page.is_first_page:
print('Page {} is predicted as the first.'.format(page.number))
else:
print('Page {} is predicted as the non-first.'.format(page.number))
FileSplittingEvaluation class¶
FileSplittingEvaluation class can be used to evaluate performance of ContextAwareFileSplittingModel, returning a set of metrics that includes precision, recall, f1 measure, true positives, false positives and false negatives.
The class’s methods calculate()
and calculate_by_category()
are run at initialization. The class receives two lists
of Documents as an input – first list consists of ground-truth Documents where all first Pages are marked as such,
second is of Documents on Pages of which FileSplittingModel ran a prediction of them being first or non-first.
The initialization would look like this:
evaluation = FileSplittingEvaluation(ground_truth_documents=YOUR_GROUND_TRUTH_LIST,
prediction_documents=YOUR_PREDICTION_LIST)
The class compares each pair of Pages. If a Page is labeled as first and the model also predicted it as first, it is considered a true positive. If a Page is labeled as first but the model predicted it as non-first, it is considered a false negative. If a Page is labeled as non-first but the model predicted it as first, it is considered a false positive. If a Page is labeled as non-first and the model also predicted it as non-first, it is considered a true negative.
predicted correctly |
predicted incorrectly |
|
---|---|---|
first Page |
TP |
FN |
non-first Page |
TN |
FP |
After iterating through all Pages of all Documents, precision, recall and f1 measure are calculated. If you wish to set
metrics to None
in case there has been an attempt of zero division, set allow_zero=True
at the initialization.
To see a certain metric after the class has been initialized, you can call a metric’s method:
print(evaluation.fn())
It is also possible to look at the metrics calculated by each Category independently. For this, pass search=YOUR_CATEGORY_HERE
when calling the wanted metric’s method:
print(evaluation.fn(search=YOUR_CATEGORY_HERE))
For more details, see the Python API Documentation on Evaluation.
Example of evaluation input and output¶
Suppose in our test dataset we have 2 Documents of 2 Categories: one 3-paged, consisting of a single file (-> it has only one ground-truth first Page) of a first Category, and one 5-paged, consisting of three files: two 2-paged and one 1-paged (-> it has three ground-truth first Pages), of a second Category.

First document

Second document
# generate the test Documents
from konfuzio_sdk.data import Category, Project, Document, Page
from konfuzio_sdk.evaluate import FileSplittingEvaluation
from konfuzio_sdk.trainer.file_splitting import SplittingAI
text_1 = "Hi all,\nI like bread.\nI hope to get everything done soon.\nHave you seen it?"
document_1 = Document(id_=None, project=YOUR_PROJECT, category=YOUR_CATEGORY_1, text=text_1, dataset_status=3)
_ = Page(
id_=None,
original_size=(320, 240),
document=document_1,
start_offset=0,
end_offset=21,
number=1,
)
_ = Page(
id_=None,
original_size=(320, 240),
document=document_1,
start_offset=22,
end_offset=57,
number=2,
)
_ = Page(
id_=None,
original_size=(320, 240),
document=document_1,
start_offset=58,
end_offset=75,
number=3,
)
text_2 = "Good evening,\nthank you for coming.\nCan you give me that?\nI need it.\nSend it to me."
document_2 = Document(id_=None, project=YOUR_PROJECT, category=YOUR_CATEGORY_2, text=text_2, dataset_status=3)
_ = Page(
id_=None,
original_size=(320, 240),
document=document_2,
start_offset=0,
end_offset=12,
number=1
)
_ = Page(
id_=None,
original_size=(320, 240),
document=document_2,
start_offset=13,
end_offset=34,
number=2
)
_ = Page(
id_=None,
original_size=(320, 240),
document=document_2,
start_offset=35,
end_offset=56,
number=3
)
_.is_first_page = True
_ = Page(
id_=None,
original_size=(320, 240),
document=document_2,
start_offset=57,
end_offset=67,
number=4
)
_ = Page(
id_=None,
original_size=(320, 240),
document=document_2,
start_offset=68,
end_offset=82,
number=5
)
_.is_first_page = True
We need to pass two lists of Documents into the FileSplittingEvaluation
class. So, before that, we need to run each
page of the documents through the model’s prediction.
Let’s say the evaluation gave good results, with only one first page being predicted as non-first and all the other pages being predicted correctly. An example of how the evaluation would be implemented would be:
splitting_ai = SplittingAI(YOUR_MODEL_HERE)
pred_1: Document = splitting_ai.propose_split_documents(document_1, return_pages=True)[0]
pred_2: Document = splitting_ai.propose_split_documents(document_2, return_pages=True)[0]
evaluation = FileSplittingEvaluation(ground_truth_documents=[document_1, document_2],
prediction_documents=[pred_1, pred_2])
print(evaluation.tp()) # returns: 3
print(evaluation.tn()) # returns: 4
print(evaluation.fp()) # returns: 0
print(evaluation.fn()) # returns: 1
print(evaluation.precision()) # returns: 1
print(evaluation.recall()) # returns: 0.75
print(evaluation.f1()) # returns: 0.85
Our results could be reflected in a following table:
TPs |
TNs |
FPs |
FNs |
precision |
recall |
F1 |
---|---|---|---|---|---|---|
3 |
4 |
0 |
1 |
1 |
0.75 |
0.85 |
If we want to see evaluation results by Category, the implementation of the Evaluation would look like this:
print(evaluation.tp(search=CATEGORY_1), evaluation.tp(search=CATEGORY_2)) # returns: 1 2
print(evaluation.tn(search=CATEGORY_1), evaluation.tn(search=CATEGORY_2)) # returns: 2 2
print(evaluation.fp(search=CATEGORY_1), evaluation.fp(search=CATEGORY_2)) # returns: 0 0
print(evaluation.fn(search=CATEGORY_1), evaluation.fn(search=CATEGORY_2)) # returns: 0 1
print(evaluation.precision(search=CATEGORY_1), evaluation.precision(search=CATEGORY_2)) # returns: 1 1
print(evaluation.recall(search=CATEGORY_1), evaluation.recall(search=CATEGORY_2)) # returns: 1 0.66
print(evaluation.f1(search=CATEGORY_1), evaluation.f1(search=CATEGORY_2)) # returns: 1 0.79
the output could be reflected in a following table:
Category |
TPs |
TNs |
FPs |
FNs |
precision |
recall |
F1 |
---|---|---|---|---|---|---|---|
Category 1 |
1 |
2 |
0 |
0 |
1 |
1 |
1 |
Category 2 |
2 |
2 |
0 |
1 |
1 |
0.66 |
0.79 |
To log metrics after evaluation, you can call EvaluationCalculator
‘s method metrics_logging
(you would need to
specify the metrics accordingly at the class’s initialization). Example usage:
EvaluationCalculator(tp=3, fp=0, fn=1, tn=4).metrics_logging()
Document Categorization¶
Categorization Fallback Logic¶
Use the name of the category as an effective fallback logic to categorize documents when no categorization AI is available:
from konfuzio_sdk.data import Project
from konfuzio_sdk.trainer.document_categorization import FallbackCategorizationModel
project = Project(id_=YOUR_PROJECT_ID)
categorization_model = FallbackCategorizationModel(project)
categorization_model.categories = project.categories
test_document = project.get_document_by_id(YOUR_DOCUMENT_ID)
# returns virtual SDK Document with category attribute
result_doc = categorization_model.categorize(document=test_document)
# if the input document is already categorized, the already present category is used
# unless recategorize is True
result_doc = categorization_model.categorize(document=test_document, recategorize=True)
print(f"Found category {result_doc.category} for {result_doc}")
# option to modify the provided document in place
categorization_model.categorize(document=test_document, inplace=True)
print(f"Found category {test_document.category} for {test_document}")
Train a Konfuzio SDK Model to Extract Information From Payslip Documents¶
The tutorial RFExtractionAI Demo aims to show you how to use the Konfuzio SDK package to use a simple Whitespace tokenizer and to train a “RFExtractionAI” model to find and extract relevant information like Name, Date and Recipient from payslip documents.
You can or download it from here
and try it by yourself.
Retrain Flair NER-Ontonotes-Fast with Human Revised Annotations¶
The tutorial HRetrain Flair NER-Ontonotes-Fast with Human Revised Annotations aims to show you how to use the Konfuzio SDK package to include an easy feedback workflow in your training pipeline. It also gives an example of how you can take advantage of open-source models to speed up the annotation process and use the feedback workflow to adapt the domain knowledge of the model to your aim.
You can or download it from
here
and try it by yourself.
Count Relevant Expressions in Annual Reports¶
The tutorial Count Relevant Expressions in Annual Reports aims to show you how to use the Konfuzio SDK package to retrieve structured and organized information that can be used for a deeper analysis and understanding of your data. It will show you how to identify and count pre-specified expressions in documents and how to collect that information.
You can or download it from
here
and try it by yourself.