Explanations¶
Explanation is discussion that clarifies and illuminates a particular topic. Explanation is understanding-oriented.
Data Layer Concepts¶
The relations between all major Data Layer concepts of the SDK are the following: a Project consists of multiple Documents. Each one of the Documents consists of the Pages and belongs to a certain Category. Text in a Document can be marked by Annotations, which can be multi-line, and where each continuous piece of text contained into an Annotation is a Span. Each Annotation is located within a certain Bbox and is defined by a Label that is a part of one of the Label Sets. An Annotation Set is a list of Annotations that share a Label Set.
For more detailed information on each concept, follow the link on the concept’s name which leads to the automatically generated documentation.
Project¶
Project is essentially a dataset that contains Documents
belonging to different Categories or not having any Category assigned. To initialize it, call Project(id_=YOUR_PROJECT_ID)
.
The Project can also be accessed via the Smartview, with URL typically looking like https://YOUR_HOST/admin/server/document/?project=YOUR_PROJECT_ID.
If you have made some local changes to the Project and want to return to the initial version available at the server, or
if you want to fetch the updates from the server, use the argument update=True
.
Here are the some of properties and methods of the Project you might need when working with the SDK:
project.documents
– training Documents within the Project;project.test_documents
– test Documents within the Project;project.get_category_by_id(YOUR_CATEGORY_ID).documents()
– Documents filtered by a Category of your choice;project.get_document_by_id(YOUR_DOCUMENT_ID)
– access a particular Document from the Project if you know its ID.
Document¶
Document is one of the files that constitute a Project. It consists of Pages and can belong to a certain Category.
A Document can be accessed by project.get_document_by_id(YOUR_DOCUMENT_ID)
when its ID is known to you; otherwise, it
is possible to iterate through the output of project.documents
(or test_documents
/_documents
) to see which
Documents are available and what IDs they have.
The Documents can also be accessed via the Smartview, with URL typically looking like https://YOUR_HOST/projects/PROJECT_ID/docs/DOCUMENT_ID/bbox-annotations/.
Here are some of the properties and methods of the Document you might need when working with the SDK:
document.id_
– get an ID of the Document;document.text
– get a full text of the Document;document.pages()
– a list of Pages in the Document;document.update()
– download a newer version of the Document from the Server in case you have made some changes in the Smartview;document.category()
– get a Category the Document belongs to;document.get_images()
– download PNG images of the Pages in the Document; can be used if you wish to use the visual data for training your own models, for example;
Category¶
Category is a group of Documents united by common feature or type, i.e. invoice or receipt.
To see all Categories in the Project, you can use project.get_categories()
.
To find a Category the Document belongs to, you can use document.category
.
To get documents
or test_documents
under the Category, use category.documents()
or category.test_documents()
respectively.
You can also observe all Categories available in the Project via the Smartview: they are listed on the Project’s page in the menu on the right.
Page¶
Page is a constituent part of the Document. Here are some of the properties and methods of the Page you might need when working with the SDK:
page.text
– get text of the Page;page.spans()
– get a list of Spans on the Page;page.number
– get Page’s number, starting from 1.
Span¶
Span is a part of the Document’s text without the line breaks. Each Span has start_offset
and end_offset
denoting its starting and finishing characters in document.text
.
To access Span’s text, you can call span.offset_string
. We are going to use it later when collecting the Spans from the Documents.
Annotation¶
Annotation is a combination of Spans that has a certain Label (i.e. Issue_Date, Auszahlungsbetrag) assigned to it. They typically denote a certain type of entity that is found in the text. Annotations can be predicted by AI or human-added.
Like Spans, Annotations also have start_offset
and end_offset
denoting the starting and the ending characters. To access the text under the Annotation, call annotation.offset_string
.
To see the Annotation in the Smartview, you can call annotation.get_link()
and open the returned URL.
Annotation Set¶
Annotation Set is a group of Annotations united by Labels
belonging to the same Label Set. To see Annotations in the set, call annotation_set.annotations()
.
Label¶
Label defines what the Annotation is about (i.e. Issue_Date,
Auszahlungsbetrag). Labels are grouped into Label Sets. To see Annotations with a current Label,
call label.annotations()
.
Label Set¶
Label Set is a group of Labels united. A Label Set can belong to different Categories and multiple Annotation Sets.
Bbox¶
Bbox is an area of the Page denoted by four rectangle-like
coordinates. You can access Bboxes of the Document by calling document.bboxes
.
Architecture SDK to Server¶
The following chart is automatically created by the version of the diagram on the branch master, see source.
If you hover over the image you can zoom or use the full page mode.
If you want to edit the diagramm, please refer to the GitHub Drawio Documentation.
Directory Structure¶
├── konfuzio-sdk <- SDK project name
│ │
│ ├── docs <- Documentation to use konfuzio_sdk package in a project
│ │
│ ├── konfuzio_sdk <- Source code of Konfuzio SDK
│ │ ├── __init__.py <- Makes konfuzio_sdk a Python module
│ │ ├── api.py <- Functions to interact with the Konfuzio Server
│ │ ├── cli.py <- Command Line interface to the konfuzio_sdk package
│ │ ├── data.py <- Functions to handle data from the API
│ │ ├── settings_importer.py <- Meta settings loaded from the project
│ │ ├── urls.py <- Endpoints of the Konfuzio host
│ │ └── utils.py <- Utils functions for the konfuzio_sdk package
│ │
│ ├── tests <- Pytests: basic tests to test scripts based on a demo project
│ │
│ ├── .gitignore <- Specify files untracked and ignored by git
│ ├── README.md <- Readme to get to know konfuzio_sdk package
│ ├── pytest.ini <- Configurations for pytests
│ ├── settings.py <- Settings of SDK project
│ ├── setup.cfg <- Setup configurations
│ ├── setup.py <- Installation requirements
examples of how to use it for visualization, for example.
Coordinates System¶
The size of a page of a Document can be obtained in the Document object. The format is [width, height].
Original size of the Document is the size of the uploaded Document (which can be a PDF file or an image). The bounding boxes of the Annotations are based on this size.
E.g.: [1552, 1932]
Current size can be accessed via calling height
and width
from the Page object. They show the dimensions of the
image representation of a Document Page. These representations are used for computer vision tasks and the SmartView.
E.g.: [372.48, 463.68]
from konfuzio_sdk.data import Project
my_project = Project(id_=YOUR_PROJECT_ID)
# first Document uploaded
document = my_project.documents[0]
# index of the Page to test
page_index = 0
width = document.pages()[page_index].width
height = document.pages()[page_index].height
The coordinates system used has its start in the bottom left corner of the page.

To visualize the character bounding boxes of a document and overlapping them in the image opened with the python library PIL, for example, we can resize the image to the size in which they are based (original_size). The following code can be used for this:
from PIL import ImageDraw
from konfuzio_sdk.data import Project
my_project = Project(id_=YOUR_PROJECT_ID, strict_data_validation=False)
# first Document uploaded
document = my_project.get_document_by_id(YOUR_DOCUMENT_ID)
# index of the Page to test
page_index = 0
width = document.pages()[page_index].width
height = document.pages()[page_index].height
image = page.get_image(update=True)
factor_x = width / image.width
factor_y = height / image.height
image = image.convert('RGB')
image = image.resize((int(image.size[0] * factor_x), int(image.size[1] * factor_y)))
height = image.size[1]
image_characters_bbox = [char_bbox for _, char_bbox in page.get_bbox().items()]
draw = ImageDraw.Draw(image)
for bbox in image_characters_bbox:
image_bbox = (
int(bbox["x0"]),
int((height - bbox["y1"])),
int(bbox["x1"]),
int((height - bbox["y0"])),
)
draw.rectangle(image_bbox, outline='green', width=1)
image
# Note: cv2 has the origin of the y coordinates in the upper left corner. Therefore, for visualization, the
# height of the image is subtracted to the y coordinates.

The coordinates obtained from the segmentation endpoint of the API are based on the image array shape. To visualize the segmentation bounding boxes of a page on an image opened with the python library PIL, for example, we can overlap them directly.
from PIL import ImageDraw
from konfuzio_sdk.data import Project
from konfuzio_sdk.api import get_results_from_segmentation
my_project = Project(id_=YOUR_PROJECT_ID, strict_data_validation=False)
# first Document uploaded
document = my_project.get_document_by_id(YOUR_DOCUMENT_ID)
# index of the Page to test
page_index = 0
page = document.pages()[page_index]
image = page.get_image(update=True)
image = image.convert('RGB')
draw = ImageDraw.Draw(image)
image_segmentation_bboxes = get_results_from_segmentation(document.id_, my_project.id_)
for bbox in image_segmentation_bboxes[page_index]:
image_bbox = (
int(bbox["x0"]),
int(bbox["y0"]),
int(bbox["x1"]),
int(bbox["y1"]),
)
draw.rectangle(image_bbox, outline='red', width=1)
image

To visualize both at the same time we can convert the coordinates from the segmentation result to be based on the image size used for the characters’ bbox.
from PIL import ImageDraw
from konfuzio_sdk.data import Project
from konfuzio_sdk.api import get_results_from_segmentation
my_project = Project(id_=YOUR_PROJECT_ID, strict_data_validation=False)
# first Document uploaded
document = my_project.get_document_by_id(YOUR_DOCUMENT_ID)
# index of the Page to test
page_index = 0
width = document.pages()[page_index].width
height = document.pages()[page_index].height
page = document.pages()[page_index]
image = page.get_image(update=True)
factor_x = width / image.width
factor_y = height / image.height
image = image.convert('RGB')
image = image.resize((int(image.size[0] * factor_x), int(image.size[1] * factor_y)))
height = image.size[1]
image_characters_bbox = [char_bbox for _, char_bbox in page.get_bbox().items()]
draw = ImageDraw.Draw(image)
for bbox in image_characters_bbox:
image_bbox = (
int(bbox["x0"]),
int((height - bbox["y1"])),
int(bbox["x1"]),
int((height - bbox["y0"])),
)
draw.rectangle(image_bbox, outline='green', width=1)
image_segmentation_bboxes = get_results_from_segmentation(document.id_, my_project.id_)
for bbox in image_segmentation_bboxes[page_index]:
image_bbox = (
int(bbox["x0"] * factor_x),
int(bbox["y0"] * factor_y),
int(bbox["x1"] * factor_x),
int(bbox["y1"] * factor_y),
)
draw.rectangle(image_bbox, outline='red', width=1)
image

Our extraction AI runs a merging logic at two steps in the extraction process. The first is a horizontal merging of Spans right after the Label classifier. This can be particularly useful when using the Whitespace tokenizer as it can find Spans containing spaces. The second merging logic is a vertical merging of Spans into a single multiline Annotation. Checkout the architecture diagram for more detail.
Horizontal Merge¶
When using an Extraction AI, we merge adjacent horizontal Spans right after the Label classifier. The confidence of the resulting new Span if taken to be the mean confidence of the original Spans being merged.
A horizontal merging is valid only if:
All Spans have the same predicted Label
Confidence of predicted Label is above the Label threshold
All Spans are on the same line
Spans are not overlapping
No extraneous characters in between Spans
A maximum of 5 spaces in between Spans
The Label type is not one of the following: ‘Number’, ‘Positive Number’, ‘Percentage’, ‘Date’ OR the resulting merging create a Span normalizable to the same type
Input |
Able to merge? |
Reason |
Result |
---|---|---|---|
Text Annotation |
yes |
/ |
Text Annotation |
Text Annotation |
no |
Text Annotation |
|
Text . Annotation |
no |
Text . Annotation |
|
Annotation 7 |
no |
Annotation 7 |
|
34 98 |
no |
|
34 98 |
34 98 |
yes |
/ |
34 98 |
November 2022 |
yes |
/ |
November 2022 |
Novamber 2022 |
no |
|
Novamber 2022 |
34 98% |
yes |
/ |
34 98% |
34 98% |
no |
34 98% |
Label Type: Text
Label Type: Number
Label Type: Date
Label Type: Percentage
Label Type: NO LABEL/Below Label threshold
Vertical Merge¶
When using an Extraction AI, we join adjacent vertical Spans into a single Annotation after the LabelSet classifier.
A vertical merging is valid only if:
They are on the same Page
They are predicted to have the same Label
Multiline annotations with this Label exist in the training set
Consecutive vertical Spans either overlap in the x-axis, OR the preceding Span is at the end of the line, and following Span is at the beginning of the next
Confidence of predicted Label is above the Label threshold
Spans are on consecutive lines
Merged lower Span belongs to an Annotation in the same AnnotationSet, OR to an AnnotationSet with only a single Annotation
Input |
Able to merge? |
Reason |
---|---|---|
Text |
yes |
/ |
Annotation |
no |
|
Text more text |
no |
|
Some random text Text |
yes |
/ |
Some random text Text . |
no |
|
Text more text |
yes |
/ |
Text |
no |
|
Annotation Nb. |
yes |
* |
Annotation 41 |
no |
|
* The bottom Annotation is alone in its AnnotationSet and therefore can be merged.
** The Annotations on each line have been grouped into their own AnnotationSets and are not merged.
Label 1
Label 2
NO LABEL/Below Label threshold
Horizontal and Vertical Merge with the Paragraph and Sentence Tokenizers¶
When using the Paragraph or Sentence Tokenizer together with our Extraction AI model, we do not use the rule based vertical and horizontal merge logic above, and instead use the sentence/paragraph segmentation provided by the Tokenizer.
The logic is as follows:
And here’s an illustrated example of the merge logic in action:
