Code Examples

Example Usage

Project

Retrieve all information available for your project:

my_project = Project()

The information will be stored in the folder that you defined to allocate the data in the package initialization. A subfolder will be created for each document in the project.

Every time that there are changes in the project in the Konfuzio Server, the local project can be updated with:

my_project.update()

Documents

To access the documents in the project you can use:

documents = my_project.get_documents_by_status()

By default, it will get the documents without dataset status (dataset_status = 0 (None)). You can specify another dataset status with the argument ‘dataset_statuses’. The code for the status is:

None: 0
Preparation: 1
Training: 2
Test: 3
Low OCR Quality: 4

For example, to get all documents in the project, you can do:

documents = my_project.get_documents_by_status(dataset_statuses=[0, 1, 2, 3, 4])

The training and test documents can be accessed directly by:

training_documents = my_project.documents
test_documents = my_project.test_documents

By default, you get 4 files for each document that contain information of the text, pages, sections and annotations. You can see these files inside the document folder.

document.txt - Contains the text of the document. If OCR was used, it will correspond to the result from the OCR.

                                                            x02   328927/10103/00104
Abrechnung  der Brutto/Netto-Bezüge   für Dezember 2018                   22.05.2018 Bat:  1

Personal-Nr.  Geburtsdatum ski Faktor  Ki,Frbtr.Konfession  ‚Freibetragjährl.! |Freibetrag mt! |DBA  iGleitzone  'St.-Tg.  VJuUr. üb. |Url. Anspr. Url.Tg.gen.  |Resturlaub
00104 150356 1  |     ‚ev                              30     400  3000       3400

SV-Nummer       |Krankenkasse                       KK%®|PGRS Bars  jum.SV-Tg. Anw. Tage |UrlaubTage Krankh. Tg. Fehlz. Tage

50150356B581 AOK  Bayern Die Gesundheitskas 157 101 1111 1 30

                                             Eintritt   ‚Austritt     Anw.Std.  Urlaub Std.  |Krankh. Std. |Fehlz. Std.

                                             170299  L L       l     L     l     l
 -                                       +  Steuer-ID       IMrB?       Zeitlohn Sta.|Überstd.  |Bez. Sta.
  Teststraße123
   12345 Testort                                   12345678911           ı     ı     \
                               B/N
               Pers.-Nr.  00104        x02
               Abt.-Nr. A12         10103          HinweisezurAbrechnung

pages.json5 - Contains information of each page of the document (for example, their ids and sizes).

[
  {
    "id": 1923,
    "image": "/page/show/1923/",
    "number": 1,
    "original_size": [
      595.2,
      841.68
    ],
    "size": [
      1414,
      2000
    ]
  }
]

sections.json5 - Contains information of each section in the document (for example, their ids and label sets).

[
  {
    "id": 78730,
    "position": 1,
    "section_label": 63
  },
  {
    "id": 292092,
    "position": 1,
    "section_label": 64
  }
]

annotations.json5 - Contains information of each annotation in the document (for example, their labels and bounding boxes).

[
  {
    "accuracy": null,
    "bbox": {
      "bottom": 44.369,
      "line_index": 1,
      "page_index": 0,
      "top": 35.369,
      "x0": 468.48,
      "x1": 527.04,
      "y0": 797.311,
      "y1": 806.311
    },
    "bboxes": [
      {
        "bottom": 44.369,
        "end_offset": 169,
        "line_number": 2,
        "offset_string": "22.05.2018",
        "offset_string_original": "22.05.2018",
        "page_index": 0,
        "start_offset": 159,
        "top": 35.369,
        "x0": 468.48,
        "x1": 527.04,
        "y0": 797.311,
        "y1": 806.311
      }
    ],
    "created_by": 59,
    "custom_offset_string": false,
    "end_offset": 169,
    "get_created_by": "[email protected]",
    "get_revised_by": "n/a",
    "id": 4419937,
    "is_correct": true,
    "label": 867,
    "label_data_type": "Date",
    "label_text": "Austellungsdatum",
    "label_threshold": 0.1,--
    "normalized": "2018-05-22",
    "offset_string": "22.05.2018",
    "offset_string_original": "22.05.2018",
    "revised": false,
    "revised_by": null,
    "section": 78730,
    "section_label_id": 63,
    "section_label_text": "Lohnabrechnung",
    "selection_bbox": {
      "bottom": 44.369,
      "line_index": 1,
      "page_index": 0,
      "top": 35.369,
      "x0": 468.48,
      "x1": 527.04,
      "y0": 797.311,
      "y1": 806.311
    },
    "start_offset": 159,
    "translated_string": null
  },
...
]

Download PDFs

To get the pdfs of the documents, you can use get_file().

for document in my_project.documents:
    document.get_file()

This will download the ocr version of the document which contains the text, the bounding boxes information of the characters and the image of the document.

In the document folder, you will see a new file with the original name followed by “_ocr”.

Download pages as images

To get the pages of the document as png images, you can use get_images().

for document in my_project.documents:
    document.get_images()

You will get one png image named “page_number_of_page.png” for each page in the document.

Download bounding boxes of the characters

To get the bounding boxes information of the characters, you can use get_bbox().

for document in my_project.documents:
    document.get_bbox()

You will get a file named “bbox.json5”.

After downloading these files, the paths to them will also become available in the project instance. For example, you can get the path to the file with the document text with:

my_project.txt_file_path

Update Document

If there are changes in the document in the Konfuzio Server, you can update the document with:

document.update()

You can also update a document by updating the entire project via project.update(). However, for projects with many documents it can be faster to update only the relevant documents.

Create Regex-based Annotations

Let’s see a simple example of how can we use the konfuzio_sdk package to get information on a project and to post annotations.

You can follow the example below to post annotations of a certain word or expression in the first document uploaded.

import re

from konfuzio_sdk.data import Project, Annotation, Label

my_project = Project()

# Word/expression to annotate in the document
# should match an existing one in your document
input_expression = "John Smith"

# Label for the annotation
label_name = "Name"
# Creation of the Label in the project default label set
my_label = Label(my_project, text=label_name)
# Saving it online
my_label.save()

# Label Set where label belongs
label_set_id = my_label.label_sets[0].id

# First document in the project
document = my_project.documents[0]

# Matches of the word/expression in the document
matches_locations = [(m.start(0), m.end(0)) for m in re.finditer(input_expression, document.text)]

# List to save the links to the annotations created
new_annotations_links = []

# Create annotation for each match
for offsets in matches_locations:
    annotation_obj = Annotation(
        document=document,
        document_id=document.id,
        start_offset=offsets[0],
        end_offset=offsets[1],
        label=my_label,
        label_set_id=label_set_id,
        accuracy=1.0,
    )
    new_annotation_added = annotation_obj.save()
    if new_annotation_added:
        new_annotations_links.append(annotation_obj.get_link())

print(new_annotations_links)

Retrain Flair NER-Ontonotes-Fast with Human Revised Annotations

The tutorial HRetrain Flair NER-Ontonotes-Fast with Human Revised Annotations aims to show you how to use the Konfuzio SDK package to include an easy feedback workflow in your training pipeline. It also gives an example of how you can take advantage of open-source models to speed up the annotation process and use the feedback workflow to adapt the domain knowledge of the model to your aim.

You can OpenInColab or download it from here and try it by yourself.

Count Relevant Expressions in Annual Reports

The tutorial Count Relevant Expressions in Annual Reports aims to show you how to use the Konfuzio SDK package to retrieve structured and organized information that can be used for a deeper analysis and understanding of your data. It will show you how to identify and count pre-specified expressions in documents and how to collect that information.

You can OpenInColab2 or download it from here and try it by yourself.