Code Examples¶
Example Usage¶
Project¶
Retrieve all information available for your project:
my_project = Project(id_=YOUR_PROJECT_ID)
The information will be stored in the folder that you defined to allocate the data in the package initialization. A subfolder will be created for each document in the project.
Every time that there are changes in the project in the Konfuzio Server, the local project can be updated with:
my_project.update()
Documents¶
To access the documents in the project you can use:
documents = my_project.get_documents_by_status()
By default, it will get the documents without dataset status (dataset_status = 0 (None)). You can specify another dataset status with the argument ‘dataset_statuses’. The code for the status is:
None: 0
Preparation: 1
Training: 2
Test: 3
Excluded: 4
For example, to get all documents in the project, you can do:
documents = my_project.get_documents_by_status(dataset_statuses=[0, 1, 2, 3, 4])
The training and test documents can be accessed directly by:
training_documents = my_project.documents
test_documents = my_project.test_documents
By default, you get 4 files for each document that contain information of the text, pages, annotation sets and annotations. You can see these files inside the document folder.
document.txt - Contains the text of the document. If OCR was used, it will correspond to the result from the OCR.
x02 328927/10103/00104
Abrechnung der Brutto/Netto-Bezüge für Dezember 2018 22.05.2018 Bat: 1
Personal-Nr. Geburtsdatum ski Faktor Ki,Frbtr.Konfession ‚Freibetragjährl.! |Freibetrag mt! |DBA iGleitzone 'St.-Tg. VJuUr. üb. |Url. Anspr. Url.Tg.gen. |Resturlaub
00104 150356 1 | ‚ev 30 400 3000 3400
SV-Nummer |Krankenkasse KK%®|PGRS Bars jum.SV-Tg. Anw. Tage |UrlaubTage Krankh. Tg. Fehlz. Tage
50150356B581 AOK Bayern Die Gesundheitskas 157 101 1111 1 30
Eintritt ‚Austritt Anw.Std. Urlaub Std. |Krankh. Std. |Fehlz. Std.
170299 L L l L l l
- + Steuer-ID IMrB? Zeitlohn Sta.|Überstd. |Bez. Sta.
Teststraße123
12345 Testort 12345678911 ı ı \
B/N
Pers.-Nr. 00104 x02
Abt.-Nr. A12 10103 HinweisezurAbrechnung
pages.json5 - Contains information of each page of the document (for example, their ids and sizes).
[
{
"id": 1923,
"image": "/page/show/1923/",
"number": 1,
"original_size": [
595.2,
841.68
],
"size": [
1414,
2000
]
}
]
annotation_sets.json5 - Contains information of each section in the document (for example, their ids and label sets).
[
{
"id": 78730,
"position": 1,
"section_label": 63
},
{
"id": 292092,
"position": 1,
"section_label": 64
}
]
annotations.json5 - Contains information of each annotation in the document (for example, their labels and bounding boxes).
[
{
"accuracy": null,
"bbox": {
"bottom": 44.369,
"line_index": 1,
"page_index": 0,
"top": 35.369,
"x0": 468.48,
"x1": 527.04,
"y0": 797.311,
"y1": 806.311
},
"bboxes": [
{
"bottom": 44.369,
"end_offset": 169,
"line_number": 2,
"offset_string": "22.05.2018",
"offset_string_original": "22.05.2018",
"page_index": 0,
"start_offset": 159,
"top": 35.369,
"x0": 468.48,
"x1": 527.04,
"y0": 797.311,
"y1": 806.311
}
],
"created_by": 59,
"custom_offset_string": false,
"end_offset": 169,
"get_created_by": "[email protected]",
"get_revised_by": "n/a",
"id": 4419937,
"is_correct": true,
"label": 867,
"label_data_type": "Date",
"label_text": "Austellungsdatum",
"label_threshold": 0.1,--
"normalized": "2018-05-22",
"offset_string": "22.05.2018",
"offset_string_original": "22.05.2018",
"revised": false,
"revised_by": null,
"section": 78730,
"section_label_id": 63,
"section_label_text": "Lohnabrechnung",
"selection_bbox": {
"bottom": 44.369,
"line_index": 1,
"page_index": 0,
"top": 35.369,
"x0": 468.48,
"x1": 527.04,
"y0": 797.311,
"y1": 806.311
},
"start_offset": 159,
"translated_string": null
},
...
]
Download PDFs¶
To get the PDFs of the documents, you can use get_file().
for document in my_project.documents:
document.get_file()
This will download the OCR version of the document which contains the text, the bounding boxes information of the characters and the image of the document.
In the document folder, you will see a new file with the original name followed by “_ocr”.
If you want to original version of the document (without OCR) you can use ocr_version=False.
for document in my_project.documents:
document.get_file(ocr_version=False)
In the document folder, you will see a new file with the original name.
Download pages as images¶
To get the pages of the document as png images, you can use get_images().
for document in my_project.documents:
document.get_images()
You will get one png image named “page_number_of_page.png” for each page in the document.
Download bounding boxes of the characters¶
To get the bounding boxes information of the characters, you can use get_bbox().
for document in my_project.documents:
document.get_bbox()
You will get a file named “bbox.json5”.
After downloading these files, the paths to them will also become available in the project instance. For example, you can get the path to the file with the document text with:
my_project.txt_file_path
Update Document¶
If there are changes in the document in the Konfuzio Server, you can update the document with:
document.update()
If a document is part of the training or test set, you can also update it by updating the entire project via project.update(). However, for projects with many documents it can be faster to update only the relevant documents.
Upload Document¶
You can upload a document via SDK. Create a Document instance and save it. The document will be uploaded to the Konfuzio Server.
document = Document(file_path=<path_to_the_file>, project=my_project)
document.save()
By default, the document is uploaded with the dataset status “None”. If there is only one category in the project, the document will assume that category. If there is more than one category in the project, the document is uploaded without a category.
You can specify both these parameters when you upload the document by passing the correspondent code for the dataset status (see code here) and the ID of the category.
document = Document(file_path=<path_to_the_file>, project=my_project,
dataset_status=<dataset_status_code>, category_template=<category_id>)
document.save()
Modify Document¶
The dataset status and the category of a document can be modified after the document is uploaded. To change the category, you can select the category that you desire from the project based on its ID and attribute it to the document.
category = my_project.get_category_by_id(<category_id>)
document.category = category
document.dataset_status = 2
document.save()
Delete Document¶
To locally delete a document, you can use:
document.delete()
The document will be deleted from your local data folder but it will remain in the Konfuzio Server. If you want to get it again you can update the project.
Create Regex-based Annotations¶
Let’s see a simple example of how can we use the konfuzio_sdk
package to get information on a project and to post annotations.
You can follow the example below to post annotations of a certain word or expression in the first document uploaded.
import re
from konfuzio_sdk.data import Project, Annotation, Label
my_project = Project(id_=YOUR_PROJECT_ID)
# Word/expression to annotate in the document
# should match an existing one in your document
input_expression = "John Smith"
# Label for the annotation
label_name = "Name"
# Creation of the Label in the project default label set
my_label = Label(my_project, text=label_name)
# Saving it online
my_label.save()
# Label Set where label belongs
label_set_id = my_label.label_sets[0].id_
# First document in the project
document = my_project.documents[0]
# Matches of the word/expression in the document
matches_locations = [(m.start(0), m.end(0)) for m in re.finditer(input_expression, document.text)]
# List to save the links to the annotations created
new_annotations_links = []
# Create annotation for each match
for offsets in matches_locations:
annotation_obj = Annotation(
document=document,
document_id=document.id_,
start_offset=offsets[0],
end_offset=offsets[1],
label=my_label,
label_set_id=label_set_id,
accuracy=1.0,
)
new_annotation_added = annotation_obj.save()
if new_annotation_added:
new_annotations_links.append(annotation_obj.get_link())
print(new_annotations_links)
Retrain Flair NER-Ontonotes-Fast with Human Revised Annotations¶
The tutorial HRetrain Flair NER-Ontonotes-Fast with Human Revised Annotations aims to show you how to use the Konfuzio SDK package to include an easy feedback workflow in your training pipeline. It also gives an example of how you can take advantage of open-source models to speed up the annotation process and use the feedback workflow to adapt the domain knowledge of the model to your aim.
You can or download it from
here
and try it by yourself.
Count Relevant Expressions in Annual Reports¶
The tutorial Count Relevant Expressions in Annual Reports aims to show you how to use the Konfuzio SDK package to retrieve structured and organized information that can be used for a deeper analysis and understanding of your data. It will show you how to identify and count pre-specified expressions in documents and how to collect that information.
You can or download it from
here
and try it by yourself.