Environment¶
You need to install the Konfuzio SDK before diving into the tutorial.
To get up and running quickly, you can use our Colab Quick Start notebook.
As an alternative you can follow the installation section to install and initialize the Konfuzio SDK locally or on an environment of your choice.
Example Usage¶
Make sure to set up your Project (so that you can retrieve the Project ID) using our Konfuzio Guide.
Project¶
Retrieve all information available for your Project:
from konfuzio_sdk.data import Project, Document
my_project = Project(id_=YOUR_PROJECT_ID)
The information will be stored in the folder that you defined to allocate the data in the package initialization. A subfolder will be created for each Document in the Project.
Every time that there are changes in the Project in the Konfuzio Server, the local Project can be updated this way:
my_project.get(update=True)
Project 29335
To make sure that your Project is loaded with all the latest data:
my_project = Project(id_=YOUR_PROJECT_ID, update=True)
Documents¶
Every Document has a status indicating in what stage of processing it is. The code for the Document status is:
Queuing for OCR: 0
Queuing for extraction: 1
Done: 2
Could not be processed: 111
OCR in progress: 10
Extraction in progress: 20
Queuing for categorization: 3
Categorization in progress: 30
Queuing for splitting: 4
Splitting in progress: 40
Waiting for splitting confirmation: 41
To access the Documents in the Project you can use:
documents = my_project.documents
By default, it will get the Documents with training status (dataset_status = 2). The code for the dataset status is:
None: 0
Preparation: 1
Training: 2
Test: 3
Excluded: 4
The Test Documents can be accessed directly by:
test_documents = my_project.test_documents
For more details, you can check out the Project Documentation.
By default, you get 4 files for each Document that contain information of the text, Pages, Annotation Sets and Annotations. You can see these files inside the Document folder.
.txt
file contains the text of the Document. If OCR was used, it will correspond to the result from the OCR.
x02 328927/10103/00104
Abrechnung der Brutto/Netto-Bezüge für Dezember 2018 22.05.2018 Bat: 1
Personal-Nr. Geburtsdatum ski Faktor Ki,Frbtr.Konfession ‚Freibetragjährl.! |Freibetrag mt! |DBA iGleitzone 'St.-Tg. VJuUr. üb. |Url. Anspr. Url.Tg.gen. |Resturlaub
00104 150356 1 | ‚ev 30 400 3000 3400
SV-Nummer |Krankenkasse KK%®|PGRS Bars jum.SV-Tg. Anw. Tage |UrlaubTage Krankh. Tg. Fehlz. Tage
50150356B581 AOK Bayern Die Gesundheitskas 157 101 1111 1 30
Eintritt ‚Austritt Anw.Std. Urlaub Std. |Krankh. Std. |Fehlz. Std.
170299 L L l L l l
- + Steuer-ID IMrB? Zeitlohn Sta.|Überstd. |Bez. Sta.
Teststraße123
12345 Testort 12345678911 \ ı ı \
B/N
Pers.-Nr. 00104 x02
Abt.-Nr. A12 10103 HinweisezurAbrechnung
pages.json5 - Contains information of each Page of the Document (for example, their ids and sizes).
[
{
"id": 1923,
"image": "/page/show/1923/",
"number": 1,
"original_size": [
595.2,
841.68
],
"size": [
1414,
2000
]
}
]
annotation_sets.json5 - Contains information of each Annotation Set in the Document and Annotations that constitute it.
{
"id": 78730,
"label_set": {
"api_name": "Lohnabrechnung",
"description": "",
"has_multiple_annotation_sets": false,
"id": 63,
"name": "Lohnabrechnung"
},
"labels": [
{
"annotations": [
{
"confidence": 0.93,
"created_by": "[email protected]",
"custom_offset_string": false,
"Document": 44823,
"id": 4420351,
"is_correct": true,
"normalized": 2189.07,
"offset_string": "2.189,07",
"offset_string_original": "2.189,07",
"origin": "api.v2",
"revised": false,
"revised_by": null,
"selection_bbox": {
"page_index": 0,
"x0": 516.48,
"x1": 562.8,
"y0": 76.829,
"y1": 87.829
},
"span": [
{
"end_offset": 3785,
"offset_string": "2.189,07",
"offset_string_original": "2.189,07",
"page_index": 0,
"start_offset": 3777,
"x0": 516.48,
"x1": 562.8,
"y0": 76.829,
"y1": 87.829
}
],
"translated_string": null
}
annotations.json5 - Contains information of each Annotation in the Document (for example, their Labels and Bounding Boxes).
[
{
"accuracy": null,
"bbox": {
"bottom": 44.369,
"line_index": 1,
"page_index": 0,
"top": 35.369,
"x0": 468.48,
"x1": 527.04,
"y0": 797.311,
"y1": 806.311
},
"bboxes": [
{
"bottom": 44.369,
"end_offset": 169,
"line_number": 2,
"offset_string": "22.05.2018",
"offset_string_original": "22.05.2018",
"page_index": 0,
"start_offset": 159,
"top": 35.369,
"x0": 468.48,
"x1": 527.04,
"y0": 797.311,
"y1": 806.311
}
],
"created_by": 59,
"custom_offset_string": false,
"end_offset": 169,
"get_created_by": "[email protected]",
"get_revised_by": "n/a",
"id": 4419937,
"is_correct": true,
"label": 867,
"label_data_type": "Date",
"label_text": "Austellungsdatum",
"label_threshold": 0.1,--
"normalized": "2018-05-22",
"offset_string": "22.05.2018",
"offset_string_original": "22.05.2018",
"revised": false,
"revised_by": null,
"section": 78730,
"section_label_id": 63,
"section_label_text": "Lohnabrechnung",
"selection_bbox": {
"bottom": 44.369,
"line_index": 1,
"page_index": 0,
"top": 35.369,
"x0": 468.48,
"x1": 527.04,
"y0": 797.311,
"y1": 806.311
},
"start_offset": 159,
"translated_string": null
},
...
]
When needed, upon calling Document.get_bbox()
, an additional file will be downloaded to the Document folder containing the Bounding Boxes information of the characters of the Document: bbox.zip. This file can be quite large, and therefore it will be compressed in the Zip format. The decompressed file is a JSON file where the keys correspond to the indices of the characters in the Document text. The value associated with each key contains the Bounding Box information of the character. For example, for character 1000 and 1002 we would have:
{
"1000": {
"adv": 2.58,
"bottom": 128.13,
"doctop": 118.13,
"fontname": "GlyphLessFont",
"height": 10.0,
"line_number": 14,
"object_type": "char",
"page_number": 1,
"size": 10.0,
"text": "n",
"top": 118.13,
"upright": 1,
"width": 2.58,
"x0": 481.74,
"x1": 484.32,
"y0": 713.55,
"y1": 723.55
},
"1002": {
"adv": 2.64,
"bottom": 128.13,
"doctop": 118.13,
"fontname": "GlyphLessFont",
"height": 10.0,
"line_number": 14,
"object_type": "char",
"page_number": 1,
"size": 10.0,
"text": "S",
"top": 118.13,
"upright": 1,
"width": 2.64,
"x0": 486.72,
"x1": 489.36,
"y0": 713.55,
"y1": 723.55
},
// ...
}
After downloading these files, their paths will become available in the Project instance.
You can get the path to the folder containing the Documents’ folders with:
my_project.documents_folder
'data_29335/documents'
And you can get the path to the file with the Document text with:
document.txt_file_path
'data_29335/documents/6665836/document.txt'
Upload Document¶
Before you can upload a new file to your Project using the Konfuzio SDK, you must have completed the following steps:
Register for a Konfuzio account
Create a Project on Konfuzio
Install the Konfuzio SDK
For detailed instructions on these preliminary steps, refer above to the Get Started guide.
After completing the above steps, you can proceed with uploading a new file to your Project using the Konfuzio SDK. The
files must be of types specified in the Supported File Types.
Here, we’re focusing on the Document.from_file
method to create a Konfuzio Document.
A Konfuzio Document is an object representing the file you upload, it will contain the OCR (Optical Character Recognition) information of the file once processed by Konfuzio’s server.
Synchronous and Asynchronous Upload¶
You have two options for uploading your file: a synchronous method and an asynchronous method. The method is determined
by the sync
parameter in the from_file
method.
Synchronous upload (sync=True): The file is uploaded to the Konfuzio servers, and the method waits for the file to be processed. Once done, it returns a Document object with the OCR information. This is useful if you want to start working with the Document immediately after the OCR processing is completed.
Here’s an example of how to use the
from_file
method withsync
set toTrue
:
document = Document.from_file(FILE_PATH, project=my_project, sync=True)
Asynchronous upload (sync=False): With this setting, the method immediately returns an empty Document object after initiating the upload. The OCR processing takes place in the background. This method is advantageous when uploading a large file or a large number of files, as it doesn’t require waiting for each file’s processing to complete.
Here is how to use the asynchronous method:
document = Document.from_file(FILE_PATH, project=my_project, sync=False)
After asynchronous upload, you can check the status of the Document processing using the Document.update()
method on
the returned Document object. If the Document is ready, this method will update the Document object with the OCR information.
It’s important to note that if the Document is not ready, you may need to call Document.update()
again at a later time.
This could be done manually or by setting up a looping mechanism depending on your application’s workflow.
To check if the Document is ready and update it with the OCR information, you can implement a custom pulling strategy like this:
for i in range(2):
document.update()
if document.ocr_ready is True:
print(document.text)
break
time.sleep(i * 10 + 3)
For a more sophisticated pulling method for asynchronously uploaded Documents using the callback function, you can
checkout our :ref:tutorial on how to use ngrok to receive callbacks from the Konfuzio Server<async_upload_with_callback>
.
Timeout Parameter¶
When making a server request, there’s a default timeout value of 2 minutes. This means that if the server doesn’t respond within 2 minutes, the operation will stop waiting for a response and return an error. If you’re uploading a larger file, it might take more time to process, and the default timeout value might not be sufficient. In such a case, you can increase the timeout by setting the timeout parameter to a higher value.
document = Document.from_file(FILE_PATH, project=my_project, timeout=300, sync=True)
Modify Document¶
If you would like to use the SDK to modify some Document’s meta-data like the dataset status or the assignee, you can do it like this:
document.assignee = ASSIGNEE_ID
document.dataset_status = 2
document.save_meta_data()
Update Document¶
If there are changes in the Document in the Konfuzio Server, you can update your local version of the Document with:
document.update()
If a Document is part of the Training or Test set, you can also update it by updating the entire Project via
Project.get(update=True)
. However, for Projects with many Documents it can be faster to update only the relevant Documents.
Download PDFs¶
To get the PDFs of the Documents, you can use get_file()
.
for document in my_project.documents:
document.get_file()
This will download the OCR version of the Document which contains the text, the Bounding Boxes information of the characters and the image of the Document.
In the Document folder, you will see a new file with the original name followed by “_ocr”.
If you want to original version of the Document (without OCR) you can use ocr_version=False
.
for document in my_project.documents:
document.get_file(ocr_version=False)
In the Document folder, you will see a new file with the original name.
Download pages as images¶
To get the Pages of the Document as png images, you can use get_images()
.
for document in my_project.documents:
document.get_images()
You will get one png image named “page_number_of_page.png” for each Page in the Document.
Download bounding boxes of the characters¶
To get the Bounding Boxes information of the characters, you can use get_bbox()
.
for document in my_project.documents:
document.get_bbox()
You will get a file named “bbox.zip” in the Document folder. This file contains the “bbox.json5” file. You can find the path to the zip file in the Document instance with:
document.bbox_file_path
Delete Document¶
Delete Document Locally¶
To locally delete a Document, you can use:
document.delete()
The Document will be deleted from your local data folder, but it will remain in the Konfuzio Server. If you want to get it again you can update the Project.
Delete Document Online¶
If you would like to delete a Document in the remote server you can simply use the Document.delete
method the delete_online
setting set to True
. You can only delete Documents with a dataset status of None (0). Be careful! Once the Document is deleted online, we will have no way of recovering it.
document.delete(delete_online=True)
If delete_online
is set to False (the default), the Document will only be deleted on your local machine, and will be
reloaded next time you load the Project.