Get Started¶
Install SDK¶
To test our SDK you need to have an account in the Konfuzio Server and initialize the package before using it. If you are using PyCharm have a look at Quickstart with Pycharm.
1. Sign up in Konfuzio Server¶
Register for free in the Konfuzio Server.
2. Install konfuzio_sdk
package¶
Install the Python package directly in your working directory with:
pip install konfuzio_sdk
Notes:
Supported Python environments are 3.8, 3.9, 3.10, 3.11.
Please use Python 3.8 if you plan to upload your AIs to a self-hosted Konfuzio Server environment.
If you are not using a virtual environment, you may need to add the installation directory to your PATH.
3. Initialize the package¶
After the installation, initialize the package in your working directory with:
konfuzio_sdk init
This will require your credentials to access the Konfuzio Server.
At the end, one file will be created in your working directory: .env
.
The .env
file contains the credentials to access the app and should not become public.
4. Download the data¶
To download the data from your Konfuzio project you need to specify the Project ID. You can check your Project ID by selecting the project in the Projects tab in the Web App. The ID of the Project is shown in the URL. Suppose that your Project ID is 123:
konfuzio_sdk export_project 123
The data from the documents that you uploaded in your Konfuzio project will be downloaded to a folder called data_123
.
Note: Only Documents in the Training and Test sets are downloaded.
Example Usage¶
Make sure to set up your Project (so that you can retrieve the Project ID) using our Konfuzio Guide.
Project¶
Retrieve all information available for your Project:
my_project = Project(id_=YOUR_PROJECT_ID)
The information will be stored in the folder that you defined to allocate the data in the package initialization. A subfolder will be created for each Document in the Project.
Every time that there are changes in the Project in the Konfuzio Server, the local Project can be updated this way:
my_project.get(update=True)
To make sure that your Project is loaded with all the latest data:
my_project = Project(id_=YOUR_PROJECT_ID, update=True)
Documents¶
To access the Documents in the Project you can use:
documents = my_project.documents
By default, it will get the Documents with training status (dataset_status = 2). The code for the status is:
None: 0
Preparation: 1
Training: 2
Test: 3
Excluded: 4
The Test Documents can be accessed directly by:
test_documents = my_project.test_documents
For more details, you can check out the Project documentation.
By default, you get 4 files for each Document that contain information of the text, pages, annotation sets and annotations. You can see these files inside the Document folder.
document.txt - Contains the text of the Document. If OCR was used, it will correspond to the result from the OCR.
x02 328927/10103/00104
Abrechnung der Brutto/Netto-Bezüge für Dezember 2018 22.05.2018 Bat: 1
Personal-Nr. Geburtsdatum ski Faktor Ki,Frbtr.Konfession ‚Freibetragjährl.! |Freibetrag mt! |DBA iGleitzone 'St.-Tg. VJuUr. üb. |Url. Anspr. Url.Tg.gen. |Resturlaub
00104 150356 1 | ‚ev 30 400 3000 3400
SV-Nummer |Krankenkasse KK%®|PGRS Bars jum.SV-Tg. Anw. Tage |UrlaubTage Krankh. Tg. Fehlz. Tage
50150356B581 AOK Bayern Die Gesundheitskas 157 101 1111 1 30
Eintritt ‚Austritt Anw.Std. Urlaub Std. |Krankh. Std. |Fehlz. Std.
170299 L L l L l l
- + Steuer-ID IMrB? Zeitlohn Sta.|Überstd. |Bez. Sta.
Teststraße123
12345 Testort 12345678911 ı ı \
B/N
Pers.-Nr. 00104 x02
Abt.-Nr. A12 10103 HinweisezurAbrechnung
pages.json5 - Contains information of each Page of the Document (for example, their ids and sizes).
[
{
"id": 1923,
"image": "/page/show/1923/",
"number": 1,
"original_size": [
595.2,
841.68
],
"size": [
1414,
2000
]
}
]
annotation_sets.json5 - Contains information of each section in the Document (for example, their ids and Label Sets).
[
{
"id": 78730,
"position": 1,
"section_label": 63
},
{
"id": 292092,
"position": 1,
"section_label": 64
}
]
annotations.json5 - Contains information of each Annotation in the Document (for example, their Labels and Bounding Boxes).
[
{
"accuracy": null,
"bbox": {
"bottom": 44.369,
"line_index": 1,
"page_index": 0,
"top": 35.369,
"x0": 468.48,
"x1": 527.04,
"y0": 797.311,
"y1": 806.311
},
"bboxes": [
{
"bottom": 44.369,
"end_offset": 169,
"line_number": 2,
"offset_string": "22.05.2018",
"offset_string_original": "22.05.2018",
"page_index": 0,
"start_offset": 159,
"top": 35.369,
"x0": 468.48,
"x1": 527.04,
"y0": 797.311,
"y1": 806.311
}
],
"created_by": 59,
"custom_offset_string": false,
"end_offset": 169,
"get_created_by": "[email protected]",
"get_revised_by": "n/a",
"id": 4419937,
"is_correct": true,
"label": 867,
"label_data_type": "Date",
"label_text": "Austellungsdatum",
"label_threshold": 0.1,--
"normalized": "2018-05-22",
"offset_string": "22.05.2018",
"offset_string_original": "22.05.2018",
"revised": false,
"revised_by": null,
"section": 78730,
"section_label_id": 63,
"section_label_text": "Lohnabrechnung",
"selection_bbox": {
"bottom": 44.369,
"line_index": 1,
"page_index": 0,
"top": 35.369,
"x0": 468.48,
"x1": 527.04,
"y0": 797.311,
"y1": 806.311
},
"start_offset": 159,
"translated_string": null
},
...
]
Upload Document¶
To upload a new file (see Supported File Types)
in your Project using the SDK, you can use the Document.from_file
method. You can choose to create the Document in
a synchronous or asynchronous way. The synchronous way will wait for the Document to be processed and return a ready
Document. The asynchronous way will only return an empty Document object which you can use to check the status of the
processing. Simply call document.update()
to check if the Document is ready and the OCR processing is done.
If you want to upload a file, and start working with it as soon as the OCR processing step is done, we recommend
using from_file
with sync
set to True as it will wait for the Document to be processed and then return a ready
Document. Beware, this may take from a few seconds up to over a minute depending on the size of the file.
document = Document.from_file(FILE_PATH, project=my_project, sync=True)
If however you are trying to upload a large number of files and don’t want to wait for them to be processed you can use
the asynchronous option which returns an empty Document object. You can then use the update
method to check if the
Document is ready and the OCR processing is done.
document = Document.from_file(FILE_PATH, project=my_project, sync=False)
Once the OCR process is done, you can get the Document OCR results with:
document.update()
Modify Document¶
If you would like to use the SDK to modify some Document’s meta-data like the dataset status or the assignee, you can do it like this:
document.assignee = ASSIGNEE_ID
document.dataset_status = 2
document.save_meta_data()
Update Document¶
If there are changes in the Document in the Konfuzio Server, you can update the Document with:
document.update()
If a Document is part of the Training or Test set, you can also update it by updating the entire Project via
project.get(update=True)
. However, for Projects with many Documents it can be faster to update only the relevant Documents.
Download PDFs¶
To get the PDFs of the Documents, you can use get_file()
.
for document in my_project.documents:
document.get_file()
This will download the OCR version of the Document which contains the text, the Bounding Boxes information of the characters and the image of the Document.
In the Document folder, you will see a new file with the original name followed by “_ocr”.
If you want to original version of the Document (without OCR) you can use ocr_version=False
.
for document in my_project.documents:
document.get_file(ocr_version=False)
In the Document folder, you will see a new file with the original name.
Download pages as images¶
To get the Pages of the Document as png images, you can use get_images()
.
for document in my_project.documents:
document.get_images()
You will get one png image named “page_number_of_page.png” for each Page in the Document.
Download bounding boxes of the characters¶
To get the Bounding Boxes information of the characters, you can use get_bbox()
.
for document in my_project.documents:
document.get_bbox()
You will get a file named “bbox.json5”.
After downloading these files, the paths to them will also become available in the Project instance. For example, you can get the path to the file with the Document text with:
my_project.documents_folder
Delete Document¶
Delete Document Locally¶
To locally delete a Document, you can use:
document.delete()
The Document will be deleted from your local data folder, but it will remain in the Konfuzio Server. If you want to get it again you can update the Project.
Delete Document Online¶
If you would like to delete a Document in the remote server you can simply use the Document.delete
method the delete_online
setting set to True
. You can only delete Documents with a dataset status of None (0). Be careful! Once the Document is deleted online, we will have no way of recovering it.
document.dataset_status = 0
document.save_meta_data()
document.delete(delete_online=True)
If delete_online
is set to False (the default), the Document will only be deleted on your local machine, and will be
reloaded next time you load the Project, or if you run the Project.init_or_update_document
method directly.