Get Started

Install SDK

To test our SDK you need to have an account in the Konfuzio Server and initialize the package before using it. If you are using PyCharm have a look at Quickstart with Pycharm.

1. Sign up in Konfuzio Server

Register for free in the Konfuzio Server.

2. Install konfuzio_sdk package

  • Install the Python package directly in your working directory with:

    pip install konfuzio_sdk

  • It is also possible to choose between the lightweight SDK and the SDK with the AI-related components (latter one is taking up more disk space). By default, the SDK is installed as a lightweight instance. To install the full instance (.[ai]), the one with AI-related components, run the following command:

    pip install konfuzio_sdk[ai]

    Currently, the full instance cannot be installed on MacOS machines with an ARM-based chip from the M-series. The konfuzio_sdk package can only be installed on MacOS on machines with an ARM chip if the lightweight instance is installed. However the Konfuzio SDK can be used on a hosted environment such as Deepnote. Follow the instructions in section 5 of this document to install the SDK in a hosted Jupyter environment.


  • Supported Python environments are 3.8, 3.9, 3.10, 3.11.

  • Please use Python 3.8 if you plan to upload your AIs to a self-hosted Konfuzio Server environment.

  • If you are not using a virtual environment, you may need to add the installation directory to your PATH.

  • If you run this tutorial in Colab and experience any version compatibility issues when working with the SDK, restart the runtime and initialize the SDK once again; this will resolve the issue.

3. Initialize the package

After the installation, initialize the package in your working directory with:

konfuzio_sdk init

This will require your credentials to access the Konfuzio Server. At the end, one file will be created in your working directory: .env.

The .env file contains the credentials to access the app and should not become public.

4. Download the data

To download the data from your Konfuzio project you need to specify the Project ID. You can check your Project ID by selecting the project in the Projects tab in the Web App. The ID of the Project is shown in the URL. Suppose that your Project ID is 123:

konfuzio_sdk export_project 123

The data from the documents that you uploaded in your Konfuzio project will be downloaded to a folder called data_123.

Note: Only Documents in the Training and Test sets are downloaded.

5. Install the .[ai] Konfuzio SDK in a hosted Jupyter environment

This procedure is not recommended, but documented here for completeness. Due to a limitation in Deepnote’s ability to receive input from the user, the SDK cannot be initialized in a Deepnote notebook. To work around this limitation, we need to install and authenticate access to the SDK in a Colab notebook and then import the credentials in the Deepnote project.

If you don’t have one, create an account. Once you are logged in, create a new project and choose the Python 3.8 environment.

To change the environment, access the project settings by clicking on the gear icon on bottom left of the page:

Deepnote environment settings

Choose Python 3.8 from the dropdown menu:

Deepnote python version

Create a file called .env in the root of the Deepnote project. We will later use this file to store the credentials we obtained via the Colab notebook:

Deepnote create file

To install the SDK in Deepnote, run the following commands in a new Notebook cell:

!git clone
!cd konfuzio-sdk && pip install .[ai]

If the installation does not complete successfully, restart the Deepnote notebook and run the same cell again. It is not necessary to run the previous cells again. To restart the notebook press the refresh arrow at the top of the page:

Deepnote refresh

Once the status changes back to Ready, run the cell with the install command again.

As mentioned earlier, before we can use the SDK within Deepnote we need to obtain the authorization token using a Colab notebook instance. We now install the SDK in a Colab notebook and copy the credentials to the .env file in the Deepnote project.

To install the SDK in Colab, run the following commands in a new Colab notebook cell:

!pip install konfuzio_sdk[ai]
import konfuzio_sdk
!konfuzio_sdk init

Follow the instructions in the terminal to initialize the SDK. Once the SDK is initialized an .env file will exist your root Colab folder. This file contains the credentials to access the Konfuzio server. Open it, copy the content and paste it into the new file we created in the Deepnote project. The .env file should look like this:

KONFUZIO_USER = your@email
KONFUZIO_TOKEN = <40-char token>

The Konfuzio SDK is now ready to be used in your Deepnote notebook. To test it, create a new cell and run:

from import Project
PROJECT_ID = <your-project-id>
my_project = Project(id_=PROJECT_ID)

If no error is raised, the SDK is correctly installed and authenticated.


Your project ID can be obtained by the web app URL when accessing Konfuzio from your browser. From your home page, navigate to Projects and pick the project you want to work with. Then look at the URL in your browser. Your should see something like<project-id>/change/ where <project-id> is your project ID.

Example Usage

Make sure to set up your Project (so that you can retrieve the Project ID) using our Konfuzio Guide.


Retrieve all information available for your Project:

my_project = Project(id_=YOUR_PROJECT_ID)

The information will be stored in the folder that you defined to allocate the data in the package initialization. A subfolder will be created for each Document in the Project.

Every time that there are changes in the Project in the Konfuzio Server, the local Project can be updated this way:


To make sure that your Project is loaded with all the latest data:

my_project = Project(id_=YOUR_PROJECT_ID, update=True)


Every Document has a status indicating in what stage of processing it is. The code for the document status is:

  • Queuing for OCR: 0

  • Queuing for extraction: 1

  • Done: 2

  • Could not be processed: 111

  • OCR in progress: 10

  • Extraction in progress: 20

  • Queuing for categorization: 3

  • Categorization in progress: 30

  • Queuing for splitting: 4

  • Splitting in progress: 40

  • Waiting for splitting confirmation: 41

To access the Documents in the Project you can use:

documents = my_project.documents

By default, it will get the Documents with training status (dataset_status = 2). The code for the dataset status is:

  • None: 0

  • Preparation: 1

  • Training: 2

  • Test: 3

  • Excluded: 4

The Test Documents can be accessed directly by:

test_documents = my_project.test_documents

For more details, you can check out the Project documentation.

By default, you get 4 files for each Document that contain information of the text, pages, annotation sets and annotations. You can see these files inside the Document folder.

document.txt - Contains the text of the Document. If OCR was used, it will correspond to the result from the OCR.

                                                            x02   328927/10103/00104
Abrechnung  der Brutto/Netto-Bezüge   für Dezember 2018                   22.05.2018 Bat:  1

Personal-Nr.  Geburtsdatum ski Faktor  Ki,Frbtr.Konfession  ‚Freibetragjährl.! |Freibetrag mt! |DBA  iGleitzone  'St.-Tg.  VJuUr. üb. |Url. Anspr. Url.Tg.gen.  |Resturlaub
00104 150356 1  |     ‚ev                              30     400  3000       3400

SV-Nummer       |Krankenkasse                       KK%®|PGRS Bars  jum.SV-Tg. Anw. Tage |UrlaubTage Krankh. Tg. Fehlz. Tage

50150356B581 AOK  Bayern Die Gesundheitskas 157 101 1111 1 30

                                             Eintritt   ‚Austritt     Anw.Std.  Urlaub Std.  |Krankh. Std. |Fehlz. Std.

                                             170299  L L       l     L     l     l
 -                                       +  Steuer-ID       IMrB?       Zeitlohn Sta.|Überstd.  |Bez. Sta.
   12345 Testort                                   12345678911           ı     ı     \
               Pers.-Nr.  00104        x02
               Abt.-Nr. A12         10103          HinweisezurAbrechnung

pages.json5 - Contains information of each Page of the Document (for example, their ids and sizes).

    "id": 1923,
    "image": "/page/show/1923/",
    "number": 1,
    "original_size": [
    "size": [

annotation_sets.json5 - Contains information of each section in the Document (for example, their ids and Label Sets).

    "id": 78730,
    "position": 1,
    "section_label": 63
    "id": 292092,
    "position": 1,
    "section_label": 64

annotations.json5 - Contains information of each Annotation in the Document (for example, their Labels and Bounding Boxes).

    "accuracy": null,
    "bbox": {
      "bottom": 44.369,
      "line_index": 1,
      "page_index": 0,
      "top": 35.369,
      "x0": 468.48,
      "x1": 527.04,
      "y0": 797.311,
      "y1": 806.311
    "bboxes": [
        "bottom": 44.369,
        "end_offset": 169,
        "line_number": 2,
        "offset_string": "22.05.2018",
        "offset_string_original": "22.05.2018",
        "page_index": 0,
        "start_offset": 159,
        "top": 35.369,
        "x0": 468.48,
        "x1": 527.04,
        "y0": 797.311,
        "y1": 806.311
    "created_by": 59,
    "custom_offset_string": false,
    "end_offset": 169,
    "get_created_by": "[email protected]",
    "get_revised_by": "n/a",
    "id": 4419937,
    "is_correct": true,
    "label": 867,
    "label_data_type": "Date",
    "label_text": "Austellungsdatum",
    "label_threshold": 0.1,--
    "normalized": "2018-05-22",
    "offset_string": "22.05.2018",
    "offset_string_original": "22.05.2018",
    "revised": false,
    "revised_by": null,
    "section": 78730,
    "section_label_id": 63,
    "section_label_text": "Lohnabrechnung",
    "selection_bbox": {
      "bottom": 44.369,
      "line_index": 1,
      "page_index": 0,
      "top": 35.369,
      "x0": 468.48,
      "x1": 527.04,
      "y0": 797.311,
      "y1": 806.311
    "start_offset": 159,
    "translated_string": null

When needed, upon calling document.get_bbox(), an additional file will be downloaded to the Document folder containing the Bounding Boxes information of the characters of the Document: This file can be quite large, and therefore it will be compressed in the Zip format. The decompressed file is a JSON file where the keys correspond to the indices of the characters in the Document text. The value associated with each key contains the Bounding Box information of the character. For example, for character 1000 and 1002 we would have:

  "1000": {
    "adv": 2.58,
    "bottom": 128.13,
    "doctop": 118.13,
    "fontname": "GlyphLessFont",
    "height": 10.0,
    "line_number": 14,
    "object_type": "char",
    "page_number": 1,
    "size": 10.0,
    "text": "n",
    "top": 118.13,
    "upright": 1,
    "width": 2.58,
    "x0": 481.74,
    "x1": 484.32,
    "y0": 713.55,
    "y1": 723.55
  "1002": {
    "adv": 2.64,
    "bottom": 128.13,
    "doctop": 118.13,
    "fontname": "GlyphLessFont",
    "height": 10.0,
    "line_number": 14,
    "object_type": "char",
    "page_number": 1,
    "size": 10.0,
    "text": "S",
    "top": 118.13,
    "upright": 1,
    "width": 2.64,
    "x0": 486.72,
    "x1": 489.36,
    "y0": 713.55,
    "y1": 723.55
// ...

After downloading these files, their paths will become available in the Project instance.

You can get the path to the folder containing the Documents’ folders with:


And you can get the path to the file with the Document text with:


Upload Document

Before you can upload a new file to your Project using the Konfuzio SDK, you must have completed the following steps:

  1. Register for a Konfuzio account

  2. Create a Project on Konfuzio

  3. Install the Konfuzio SDK

For detailed instructions on these preliminary steps, refer above to the Get Started guide.

After completing the above steps, you can proceed with uploading a new file to your Project using the Konfuzio SDK. The files must be of types specified in the Supported File Types. Here, we’re focusing on the Document.from_file method to create a Konfuzio Document.

A Konfuzio Document is an object representing the file you upload, it will contain the OCR (Optical Character Recognition) information of the file once processed by Konfuzio’s server.

Synchronous and Asynchronous Upload

You have two options for uploading your file: a synchronous method and an asynchronous method. The method is determined by the sync parameter in the from_file method.

  1. Synchronous upload (sync=True): The file is uploaded to the Konfuzio servers, and the method waits for the file to be processed. Once done, it returns a Document object with the OCR information. This is useful if you want to start working with the Document immediately after the OCR processing is completed.

    Here’s an example of how to use the from_file method with sync set to True:

document = Document.from_file(FILE_PATH, project=my_project, sync=True)
  1. Asynchronous upload (sync=False): With this setting, the method immediately returns an empty Document object after initiating the upload. The OCR processing takes place in the background. This method is advantageous when uploading a large file or a large number of files, as it doesn’t require waiting for each file’s processing to complete.

    Here is how to use the asynchronous method:

document = Document.from_file(FILE_PATH, project=my_project, sync=False)

After asynchronous upload, you can check the status of the Document processing using the document.update() method on the returned Document object. If the Document is ready, this method will update the Document object with the OCR information.

It’s important to note that if the Document is not ready, you may need to call document.update() again at a later time. This could be done manually or by setting up a looping mechanism depending on your application’s workflow.

To check if the document is ready and update it with the OCR information, you can implement a custom pulling strategy like this:

for i in range(5):
    if document.ocr_ready is True:
    time.sleep(i * 10 + 3)

For a more sophisticated pulling method for asynchronously uploaded Documents using the callback function, you can checkout our tutorial on how to use ngrok to receive callbacks from the Konfuzio Server.

Timeout Parameter

When making a server request, there’s a default timeout value of 2 minutes. This means that if the server doesn’t respond within 2 minutes, the operation will stop waiting for a response and return an error. If you’re uploading a larger file, it might take more time to process, and the default timeout value might not be sufficient. In such a case, you can increase the timeout by setting the timeout parameter to a higher value.

document = Document.from_file(FILE_PATH, project=my_project, timeout=300, sync=True)  # 300 seconds timeout

Modify Document

If you would like to use the SDK to modify some Document’s meta-data like the dataset status or the assignee, you can do it like this:

document.assignee = ASSIGNEE_ID
document.dataset_status = 2


Update Document

If there are changes in the Document in the Konfuzio Server, you can update your local version of the Document with:


If a Document is part of the Training or Test set, you can also update it by updating the entire Project via project.get(update=True). However, for Projects with many Documents it can be faster to update only the relevant Documents.

Download PDFs

To get the PDFs of the Documents, you can use get_file().

for document in my_project.documents:

This will download the OCR version of the Document which contains the text, the Bounding Boxes information of the characters and the image of the Document.

In the Document folder, you will see a new file with the original name followed by “_ocr”.

If you want to original version of the Document (without OCR) you can use ocr_version=False.

for document in my_project.documents:

In the Document folder, you will see a new file with the original name.

Download pages as images

To get the Pages of the Document as png images, you can use get_images().

for document in my_project.documents:

You will get one png image named “page_number_of_page.png” for each Page in the Document.

Download bounding boxes of the characters

To get the Bounding Boxes information of the characters, you can use get_bbox().

for document in my_project.documents:

You will get a file named “” in the Document folder. This file contains the “bbox.json5” file. You can find the path to the zip file in the Document instance with:


Delete Document

To locally delete a Document, you can use:


The Document will be deleted from your local data folder, but it will remain in the Konfuzio Server. If you want to get it again you can update the Project.

If you would like to delete a Document in the remote server you can simply use the Document.delete method the delete_online setting set to True. You can only delete Documents with a dataset status of None (0). Be careful! Once the Document is deleted online, we will have no way of recovering it.


If delete_online is set to False (the default), the Document will only be deleted on your local machine, and will be reloaded next time you load the Project, or if you run the Project.init_or_update_document method directly.