Code Examples

Example Usage


Retrieve all information available for your project:

my_project = Project(id_=YOUR_PROJECT_ID)

The information will be stored in the folder that you defined to allocate the data in the package initialization. A subfolder will be created for each document in the project.

Every time that there are changes in the project in the Konfuzio Server, the local project can be updated with:



To access the documents in the project you can use:

documents = my_project.get_documents_by_status()

By default, it will get the documents without dataset status (dataset_status = 0 (None)). You can specify another dataset status with the argument ‘dataset_statuses’. The code for the status is:

  • None: 0

  • Preparation: 1

  • Training: 2

  • Test: 3

  • Excluded: 4

For example, to get all documents in the project, you can do:

documents = my_project.get_documents_by_status(dataset_statuses=[0, 1, 2, 3, 4])

The training and test documents can be accessed directly by:

training_documents = my_project.documents
test_documents = my_project.test_documents

By default, you get 4 files for each document that contain information of the text, pages, annotation sets and annotations. You can see these files inside the document folder.

document.txt - Contains the text of the document. If OCR was used, it will correspond to the result from the OCR.

                                                            x02   328927/10103/00104
Abrechnung  der Brutto/Netto-Bezüge   für Dezember 2018                   22.05.2018 Bat:  1

Personal-Nr.  Geburtsdatum ski Faktor  Ki,Frbtr.Konfession  ‚Freibetragjährl.! |Freibetrag mt! |DBA  iGleitzone  'St.-Tg.  VJuUr. üb. |Url. Anspr. Url.Tg.gen.  |Resturlaub
00104 150356 1  |     ‚ev                              30     400  3000       3400

SV-Nummer       |Krankenkasse                       KK%®|PGRS Bars  jum.SV-Tg. Anw. Tage |UrlaubTage Krankh. Tg. Fehlz. Tage

50150356B581 AOK  Bayern Die Gesundheitskas 157 101 1111 1 30

                                             Eintritt   ‚Austritt     Anw.Std.  Urlaub Std.  |Krankh. Std. |Fehlz. Std.

                                             170299  L L       l     L     l     l
 -                                       +  Steuer-ID       IMrB?       Zeitlohn Sta.|Überstd.  |Bez. Sta.
   12345 Testort                                   12345678911           ı     ı     \
               Pers.-Nr.  00104        x02
               Abt.-Nr. A12         10103          HinweisezurAbrechnung

pages.json5 - Contains information of each page of the document (for example, their ids and sizes).

    "id": 1923,
    "image": "/page/show/1923/",
    "number": 1,
    "original_size": [
    "size": [

annotation_sets.json5 - Contains information of each section in the document (for example, their ids and label sets).

    "id": 78730,
    "position": 1,
    "section_label": 63
    "id": 292092,
    "position": 1,
    "section_label": 64

annotations.json5 - Contains information of each annotation in the document (for example, their labels and bounding boxes).

    "accuracy": null,
    "bbox": {
      "bottom": 44.369,
      "line_index": 1,
      "page_index": 0,
      "top": 35.369,
      "x0": 468.48,
      "x1": 527.04,
      "y0": 797.311,
      "y1": 806.311
    "bboxes": [
        "bottom": 44.369,
        "end_offset": 169,
        "line_number": 2,
        "offset_string": "22.05.2018",
        "offset_string_original": "22.05.2018",
        "page_index": 0,
        "start_offset": 159,
        "top": 35.369,
        "x0": 468.48,
        "x1": 527.04,
        "y0": 797.311,
        "y1": 806.311
    "created_by": 59,
    "custom_offset_string": false,
    "end_offset": 169,
    "get_created_by": "[email protected]",
    "get_revised_by": "n/a",
    "id": 4419937,
    "is_correct": true,
    "label": 867,
    "label_data_type": "Date",
    "label_text": "Austellungsdatum",
    "label_threshold": 0.1,--
    "normalized": "2018-05-22",
    "offset_string": "22.05.2018",
    "offset_string_original": "22.05.2018",
    "revised": false,
    "revised_by": null,
    "section": 78730,
    "section_label_id": 63,
    "section_label_text": "Lohnabrechnung",
    "selection_bbox": {
      "bottom": 44.369,
      "line_index": 1,
      "page_index": 0,
      "top": 35.369,
      "x0": 468.48,
      "x1": 527.04,
      "y0": 797.311,
      "y1": 806.311
    "start_offset": 159,
    "translated_string": null

Download PDFs

To get the PDFs of the documents, you can use get_file().

for document in my_project.documents:

This will download the OCR version of the document which contains the text, the bounding boxes information of the characters and the image of the document.

In the document folder, you will see a new file with the original name followed by “_ocr”.

If you want to original version of the document (without OCR) you can use ocr_version=False.

for document in my_project.documents:

In the document folder, you will see a new file with the original name.

Download pages as images

To get the pages of the document as png images, you can use get_images().

for document in my_project.documents:

You will get one png image named “page_number_of_page.png” for each page in the document.

Download bounding boxes of the characters

To get the bounding boxes information of the characters, you can use get_bbox().

for document in my_project.documents:

You will get a file named “bbox.json5”.

After downloading these files, the paths to them will also become available in the project instance. For example, you can get the path to the file with the document text with:


Update Document

If there are changes in the document in the Konfuzio Server, you can update the document with:


If a document is part of the training or test set, you can also update it by updating the entire project via project.update(). However, for projects with many documents it can be faster to update only the relevant documents.

Upload Document

You can upload a document via SDK. Create a Document instance and save it. The document will be uploaded to the Konfuzio Server.

document = Document(file_path=<path_to_the_file>, project=my_project)

By default, the document is uploaded with the dataset status “None”. If there is only one category in the project, the document will assume that category. If there is more than one category in the project, the document is uploaded without a category.

You can specify both these parameters when you upload the document by passing the correspondent code for the dataset status (see code here) and the ID of the category.

document = Document(file_path=<path_to_the_file>, project=my_project,
                    dataset_status=<dataset_status_code>, category_template=<category_id>)

Modify Document

The dataset status and the category of a document can be modified after the document is uploaded. To change the category, you can select the category that you desire from the project based on its ID and attribute it to the document.

category = my_project.get_category_by_id(<category_id>)

document.category = category
document.dataset_status = 2

Delete Document

To locally delete a document, you can use:


The document will be deleted from your local data folder but it will remain in the Konfuzio Server. If you want to get it again you can update the project.

Create Regex-based Annotations

Let’s see a simple example of how can we use the konfuzio_sdk package to get information on a project and to post annotations.

You can follow the example below to post annotations of a certain word or expression in the first document uploaded.

import re

from import Project, Annotation, Label

my_project = Project(id_=YOUR_PROJECT_ID)

# Word/expression to annotate in the document
# should match an existing one in your document
input_expression = "John Smith"

# Label for the annotation
label_name = "Name"
# Creation of the Label in the project default label set
my_label = Label(my_project, text=label_name)
# Saving it online

# Label Set where label belongs
label_set_id = my_label.label_sets[0].id_

# First document in the project
document = my_project.documents[0]

# Matches of the word/expression in the document
matches_locations = [(m.start(0), m.end(0)) for m in re.finditer(input_expression, document.text)]

# List to save the links to the annotations created
new_annotations_links = []

# Create annotation for each match
for offsets in matches_locations:
    annotation_obj = Annotation(
    new_annotation_added =
    if new_annotation_added:


Retrain Flair NER-Ontonotes-Fast with Human Revised Annotations

The tutorial HRetrain Flair NER-Ontonotes-Fast with Human Revised Annotations aims to show you how to use the Konfuzio SDK package to include an easy feedback workflow in your training pipeline. It also gives an example of how you can take advantage of open-source models to speed up the annotation process and use the feedback workflow to adapt the domain knowledge of the model to your aim.

You can OpenInColab or download it from here and try it by yourself.

Count Relevant Expressions in Annual Reports

The tutorial Count Relevant Expressions in Annual Reports aims to show you how to use the Konfuzio SDK package to retrieve structured and organized information that can be used for a deeper analysis and understanding of your data. It will show you how to identify and count pre-specified expressions in documents and how to collect that information.

You can OpenInColab2 or download it from here and try it by yourself.