Get Started

Install SDK

To test our SDK you need to have an account in the Konfuzio Server and initialize the package before using it. If you are using PyCharm have a look at Quickstart with Pycharm.

1. Sign up in Konfuzio Server

Register for free in the Konfuzio Server.

2. Install konfuzio_sdk package

  • Install the Python package directly in your working directory with:

    pip install konfuzio_sdk

Notes:

  • Supported Python environments are 3.8, 3.9, 3.10, 3.11.

  • Please use Python 3.8 if you plan to upload your AIs to a self-hosted Konfuzio Server environment.

  • If you are not using a virtual environment, you may need to add the installation directory to your PATH.

3. Initialize the package

After the installation, initialize the package in your working directory with:

konfuzio_sdk init

This will require your credentials to access the Konfuzio Server. At the end, one file will be created in your working directory: .env.

The .env file contains the credentials to access the app and should not become public.

4. Download the data

To download the data from your Konfuzio project you need to specify the Project ID. You can check your Project ID by selecting the project in the Projects tab in the Web App. The ID of the Project is shown in the URL. Suppose that your Project ID is 123:

konfuzio_sdk export_project 123

The data from the documents that you uploaded in your Konfuzio project will be downloaded to a folder called data_123.

Note: Only Documents in the Training and Test sets are downloaded.

Example Usage

Make sure to set up your Project (so that you can retrieve the Project ID) using our Konfuzio Guide.

Project

Retrieve all information available for your Project:

my_project = Project(id_=YOUR_PROJECT_ID)

The information will be stored in the folder that you defined to allocate the data in the package initialization. A subfolder will be created for each Document in the Project.

Every time that there are changes in the Project in the Konfuzio Server, the local Project can be updated this way:

my_project.get(update=True)

To make sure that your Project is loaded with all the latest data:

my_project = Project(id_=YOUR_PROJECT_ID, update=True)

Documents

To access the Documents in the Project you can use:

documents = my_project.documents

By default, it will get the Documents with training status (dataset_status = 2). The code for the status is:

  • None: 0

  • Preparation: 1

  • Training: 2

  • Test: 3

  • Excluded: 4

The Test Documents can be accessed directly by:

test_documents = my_project.test_documents

For more details, you can check out the Project documentation.

By default, you get 4 files for each Document that contain information of the text, pages, annotation sets and annotations. You can see these files inside the Document folder.

document.txt - Contains the text of the Document. If OCR was used, it will correspond to the result from the OCR.

                                                            x02   328927/10103/00104
Abrechnung  der Brutto/Netto-Bezüge   für Dezember 2018                   22.05.2018 Bat:  1

Personal-Nr.  Geburtsdatum ski Faktor  Ki,Frbtr.Konfession  ‚Freibetragjährl.! |Freibetrag mt! |DBA  iGleitzone  'St.-Tg.  VJuUr. üb. |Url. Anspr. Url.Tg.gen.  |Resturlaub
00104 150356 1  |     ‚ev                              30     400  3000       3400

SV-Nummer       |Krankenkasse                       KK%®|PGRS Bars  jum.SV-Tg. Anw. Tage |UrlaubTage Krankh. Tg. Fehlz. Tage

50150356B581 AOK  Bayern Die Gesundheitskas 157 101 1111 1 30

                                             Eintritt   ‚Austritt     Anw.Std.  Urlaub Std.  |Krankh. Std. |Fehlz. Std.

                                             170299  L L       l     L     l     l
 -                                       +  Steuer-ID       IMrB?       Zeitlohn Sta.|Überstd.  |Bez. Sta.
  Teststraße123
   12345 Testort                                   12345678911           ı     ı     \
                               B/N
               Pers.-Nr.  00104        x02
               Abt.-Nr. A12         10103          HinweisezurAbrechnung

pages.json5 - Contains information of each Page of the Document (for example, their ids and sizes).

[
  {
    "id": 1923,
    "image": "/page/show/1923/",
    "number": 1,
    "original_size": [
      595.2,
      841.68
    ],
    "size": [
      1414,
      2000
    ]
  }
]

annotation_sets.json5 - Contains information of each section in the Document (for example, their ids and Label Sets).

[
  {
    "id": 78730,
    "position": 1,
    "section_label": 63
  },
  {
    "id": 292092,
    "position": 1,
    "section_label": 64
  }
]

annotations.json5 - Contains information of each Annotation in the Document (for example, their Labels and Bounding Boxes).

[
  {
    "accuracy": null,
    "bbox": {
      "bottom": 44.369,
      "line_index": 1,
      "page_index": 0,
      "top": 35.369,
      "x0": 468.48,
      "x1": 527.04,
      "y0": 797.311,
      "y1": 806.311
    },
    "bboxes": [
      {
        "bottom": 44.369,
        "end_offset": 169,
        "line_number": 2,
        "offset_string": "22.05.2018",
        "offset_string_original": "22.05.2018",
        "page_index": 0,
        "start_offset": 159,
        "top": 35.369,
        "x0": 468.48,
        "x1": 527.04,
        "y0": 797.311,
        "y1": 806.311
      }
    ],
    "created_by": 59,
    "custom_offset_string": false,
    "end_offset": 169,
    "get_created_by": "[email protected]",
    "get_revised_by": "n/a",
    "id": 4419937,
    "is_correct": true,
    "label": 867,
    "label_data_type": "Date",
    "label_text": "Austellungsdatum",
    "label_threshold": 0.1,--
    "normalized": "2018-05-22",
    "offset_string": "22.05.2018",
    "offset_string_original": "22.05.2018",
    "revised": false,
    "revised_by": null,
    "section": 78730,
    "section_label_id": 63,
    "section_label_text": "Lohnabrechnung",
    "selection_bbox": {
      "bottom": 44.369,
      "line_index": 1,
      "page_index": 0,
      "top": 35.369,
      "x0": 468.48,
      "x1": 527.04,
      "y0": 797.311,
      "y1": 806.311
    },
    "start_offset": 159,
    "translated_string": null
  },
...
]

Upload Document

To upload a new file (see Supported File Types) in your Project using the SDK, you can use the Document.from_file method. You can choose to create the Document in a synchronous or asynchronous way. The synchronous way will wait for the Document to be processed and return a ready Document. The asynchronous way will only return an empty Document object which you can use to check the status of the processing. Simply call document.update() to check if the Document is ready and the OCR processing is done.

If you want to upload a file, and start working with it as soon as the OCR processing step is done, we recommend using from_file with sync set to True as it will wait for the Document to be processed and then return a ready Document. Beware, this may take from a few seconds up to over a minute depending on the size of the file.

document = Document.from_file(FILE_PATH, project=my_project, sync=True)

If however you are trying to upload a large number of files and don’t want to wait for them to be processed you can use the asynchronous option which returns an empty Document object. You can then use the update method to check if the Document is ready and the OCR processing is done.

document = Document.from_file(FILE_PATH, project=my_project, sync=False)

Once the OCR process is done, you can get the Document OCR results with:

document.update()

Modify Document

If you would like to use the SDK to modify some Document’s meta-data like the dataset status or the assignee, you can do it like this:

document.assignee = ASSIGNEE_ID
document.dataset_status = 2

document.save_meta_data()

Update Document

If there are changes in the Document in the Konfuzio Server, you can update the Document with:

document.update()

If a Document is part of the Training or Test set, you can also update it by updating the entire Project via project.get(update=True). However, for Projects with many Documents it can be faster to update only the relevant Documents.

Download PDFs

To get the PDFs of the Documents, you can use get_file().

for document in my_project.documents:
    document.get_file()

This will download the OCR version of the Document which contains the text, the Bounding Boxes information of the characters and the image of the Document.

In the Document folder, you will see a new file with the original name followed by “_ocr”.

If you want to original version of the Document (without OCR) you can use ocr_version=False.

for document in my_project.documents:
    document.get_file(ocr_version=False)

In the Document folder, you will see a new file with the original name.

Download pages as images

To get the Pages of the Document as png images, you can use get_images().

for document in my_project.documents:
    document.get_images()

You will get one png image named “page_number_of_page.png” for each Page in the Document.

Download bounding boxes of the characters

To get the Bounding Boxes information of the characters, you can use get_bbox().

for document in my_project.documents:
    document.get_bbox()

You will get a file named “bbox.json5”.

After downloading these files, the paths to them will also become available in the Project instance. For example, you can get the path to the file with the Document text with:

my_project.documents_folder

Delete Document

Delete Document Locally

To locally delete a Document, you can use:

document.delete()

The Document will be deleted from your local data folder, but it will remain in the Konfuzio Server. If you want to get it again you can update the Project.

Delete Document Online

If you would like to delete a Document in the remote server you can simply use the Document.delete method the delete_online setting set to True. You can only delete Documents with a dataset status of None (0). Be careful! Once the Document is deleted online, we will have no way of recovering it.

document.dataset_status = 0

document.save_meta_data()
document.delete(delete_online=True)

If delete_online is set to False (the default), the Document will only be deleted on your local machine, and will be reloaded next time you load the Project, or if you run the Project.init_or_update_document method directly.