Source Code Reference

API

[source]

Connect to the Konfuzio Server to receive or send data.

konfuzio_sdk.api.create_label(project_id: int, label_name: str, label_sets: list, session=<requests.sessions.Session object>, **kwargs) List[dict]

Create a Label and associate it with labels sets.

Parameters
  • project_id – Project ID where to create the label

  • label_name – Name for the label

  • label_sets – Label sets that use the label

  • session – Session to connect to the server

Returns

Label ID in the Konfuzio Server.

konfuzio_sdk.api.create_new_project(project_name, token=None)

Create a new project for the user.

Returns

Response object

konfuzio_sdk.api.delete_document_annotation(document_id: int, annotation_id: int, session=<requests.sessions.Session object>)

Delete a given annotation of the given document.

Parameters
  • document_id – ID of the document

  • annotation_id – ID of the annotation

  • session – Session to connect to the server.

Returns

Response status.

konfuzio_sdk.api.delete_file_konfuzio_api(document_id: int, session=<requests.sessions.Session object>)

Delete Document by ID via Konfuzio API.

Parameters
  • document_id – ID of the document

  • session – Session to connect to the server

Returns

File id in Konfuzio Server.

konfuzio_sdk.api.download_file_konfuzio_api(document_id: int, ocr: bool = True, session=<requests.sessions.Session object>)

Download file from the Konfuzio server using the document id.

Django authentication is form-based, whereas DRF uses BasicAuth.

Parameters
  • document_id – ID of the document

  • ocr – Bool to get the ocr version of the document

  • session – Session to connect to the server

Returns

The downloaded file.

konfuzio_sdk.api.download_images(urls: Optional[List[str]] = None)

Download images by a list of urls.

Parameters

urls – URLs of the images

Returns

Downloaded images.

konfuzio_sdk.api.get_auth_token(username, password)

Generate the authentication token for the user.

Returns

The new generated token.

konfuzio_sdk.api.get_csrf(session)

Get new CSRF from the host.

Parameters

session – Working session

Returns

New CSRF token.

konfuzio_sdk.api.get_document_annotations(document_id, include_extractions=False, session=<requests.sessions.Session object>)

Use the text-extraction server to retrieve human revised annotations.

Parameters
  • document_id – ID of the file

  • include_extractions – Bool to include extractions

  • session – Session to connect to the server

Returns

Sorted annotations.

konfuzio_sdk.api.get_document_details(document_id, session=<requests.sessions.Session object>)

Use the text-extraction server to retrieve the data from a document.

Parameters
  • document_id – ID of the document

  • session – Session to connect to the server

Returns

Data of the document.

konfuzio_sdk.api.get_document_hocr(document_id, session=<requests.sessions.Session object>)

Use the text-extraction server to retrieve the hOCR data.

Parameters
  • document_id – ID of the file

  • session – Session to connect to the server

Returns

hOCR data of the document.

konfuzio_sdk.api.get_document_text(document_id, session=<requests.sessions.Session object>)

Use the text-extraction server to retrieve the text found in the document.

Parameters
  • document_id – ID of the file

  • session – Session to connect to the server

Returns

Document text.

konfuzio_sdk.api.get_meta_of_files(session=<requests.sessions.Session object>) List[dict]

Get dictionary of previously uploaded document names to Konfuzio API.

Dataset_status: NONE = 0 PREPARATION = 1 TRAINING = 2 TEST = 3 LOW_OCR_QUALITY = 4

Parameters

session – Session to connect to the server

Returns

Sorted documents names in the format {id: ‘pdf_name’}.

konfuzio_sdk.api.get_project_label_sets(session=<requests.sessions.Session object>) List[dict]

Get Label Sets available in project.

Parameters

session – Session to connect to the server

Returns

Sorted Label Sets.

konfuzio_sdk.api.get_project_labels(session=<requests.sessions.Session object>) List[dict]

Get Labels available in project.

Parameters

session – Session to connect to the server

Returns

Sorted labels.

konfuzio_sdk.api.get_project_list(token)

Get the list of all projects for the user.

Returns

Response object

konfuzio_sdk.api.get_project_name_from_id(project_id: int) str

Get the project name given the project_id.

Parameters

project_id – ID of the project

Returns

Name of the project in JSON format.

konfuzio_sdk.api.get_results_from_segmentation(doc_id: int, project_id: int) List[dict]

Get bbox results from segmentation endpoint.

Parameters
  • doc_id – ID of the document

  • project_id – ID of the project.

konfuzio_sdk.api.is_url(url: str) bool

Return true if the string is a valid URL.

Parameters

url – String URL

Returns

True if is a valid URL.

konfuzio_sdk.api.is_url_image(image_url)

Check if the URL will return an image.

Parameters

image_url – URL of the image

Returns

If the URL returns an image

konfuzio_sdk.api.konfuzio_session(token=None)

Create a session incl. base auth to the KONFUZIO_HOST.

Returns

Request session.

konfuzio_sdk.api.post_document_annotation(document_id: int, start_offset: int, end_offset: int, label_id: int, label_set_id: int, accuracy: float, revised: bool = False, is_correct: bool = False, annotation_set=None, define_annotation_set=True, session=<requests.sessions.Session object>)

Add an annotation to an existing document.

For the annotation set definition, we can: - define the annotation set id where the annotation should belong (annotation_set=x (int), define_annotation_set=True) - pass it as None and a new annotation set will be created (annotation_set=None, define_annotation_set=True) - do not pass the annotation set field and a new annotation set will be created if does not exist any or the annotation will be added to the previous annotation set created (define_annotation_set=False)

Parameters
  • document_id – ID of the file

  • start_offset – Start offset of the annotation

  • end_offset – End offset of the annotation

  • label_id – ID of the label.

  • label_set_id – ID of the label set where the annotation belongs

  • accuracy – Accuracy of the annotation

  • revised – If the annotation is revised or not (bool)

  • is_correct – If the annotation is corrected or not (bool)

  • annotation_set – Annotation set to connect to the server

  • define_annotation_set – If to define the annotation set (bool)

Returns

Response status.

konfuzio_sdk.api.post_document_bulk_annotation(document_id: int, annotation_list, session=<requests.sessions.Session object>)

Add a list of annotations to an existing document.

Parameters
  • document_id – ID of the file

  • annotation_list – List of annotations

  • session – Session to connect to the server

Returns

Response status.

konfuzio_sdk.api.retry_get(session, url)

Workaround to avoid exceptions in case the server does not respond.

Parameters
  • session – Working session

  • url – Url of the endpoint

Returns

Response.

konfuzio_sdk.api.update_file_status_konfuzio_api(document_id: int, file_name: str, dataset_status: int = 0, session=<requests.sessions.Session object>, **kwargs)

Update the dataset status of an existing document via Konfuzio API.

Parameters
  • document_id – ID of the document

  • dataset_status – New dataset status

  • session – Session to connect to the server

Returns

Response status.

konfuzio_sdk.api.upload_file_konfuzio_api(filepath: str, project_id: int, session=<requests.sessions.Session object>, dataset_status: int = 0)

Upload file to Konfuzio API.

Parameters
  • filepath – Path to file to be uploaded

  • session – Session to connect to the server

  • project_id – Project ID where to upload the document

Returns

Response status.

CLI

[source]

Command Line interface to the konfuzio_sdk package.

konfuzio_sdk.cli.create_project(token=None)

Create a new project.

konfuzio_sdk.cli.data()

Download the data from the example project.

It has to be run after having the .env and settings.py files.

konfuzio_sdk.cli.init(project_folder='./')

Add settings and .env files to the working directory.

Parameters

project_folder – Root folder of the project

konfuzio_sdk.cli.init_env(project_folder)

Add the .env file to the working directory.

Parameters

project_folder – Root folder of the project where to place the .env file

Returns

file content

konfuzio_sdk.cli.init_settings(project_folder)

Add settings file to the working directory.

Parameters

project_folder – root folder of the project where to place the settings file

konfuzio_sdk.cli.main()

CLI of Konfuzio SDK.

konfuzio_sdk.cli.verify_data_folder(project_folder, data_folder)

Verify if data folder is empty.

If not empty, asks the user for a new folder name or if to use the same.

Parameters
  • project_folder – Root folder of the project

  • data_folder – Name of the data folder

Returns

Final name for the data folder

Data

[source]

Handle data from the API.

Data Class

class konfuzio_sdk.data.Data

Collect general functionality to work with data from API.

Annotation Set Class

class konfuzio_sdk.data.AnnotationSet(id, document, label_set, annotations, **kwargs)

Represent an Annotation Set - group of annotations.

Label Set Class

class konfuzio_sdk.data.LabelSet(project, id: int, name: str, name_clean: str, labels: List[int], is_default=False, categories: List[konfuzio_sdk.data.Category] = [], has_multiple_annotation_sets=False, **kwargs)

A Label Set is a group of labels.

add_label(label)

Add label to Label Set, if it does not exist.

Parameters

label – Label ID to be added

Label Class

class konfuzio_sdk.data.Label(project, id: Optional[int] = None, text: Optional[str] = None, get_data_type_display: Optional[str] = None, text_clean: Optional[str] = None, description: Optional[str] = None, label_sets: List[konfuzio_sdk.data.LabelSet] = [], has_multiple_top_candidates: bool = False, *initial_data, **kwargs)

A label is the name of a group of individual pieces of information annotated in a type of document.

add_label_set(label_set: konfuzio_sdk.data.LabelSet)

Add label set to label, if it does not exist.

Parameters

label_set – Label set to add

property annotations

Add annotation to label.

Returns

Annotations

property correct_annotations

Return correct annotations.

property documents: List[konfuzio_sdk.data.Document]

Return all documents which contain annotations of this label.

property label_sets

Get the label sets in which this label is used.

save() bool

Save Label online.

If no label sets are specified, the label is associated with the first default label set of the project.

Returns

True if the new label was created.

Annotation Class

class konfuzio_sdk.data.Annotation(start_offset: int, end_offset: int, label=None, is_correct: bool = False, revised: bool = False, id: Optional[int] = None, accuracy: Optional[float] = None, document=None, annotation_set=None, label_set_text=None, translated_string=None, label_set_id=None, *initial_data, **kwargs)

An annotation is ~a single piece~ of a set of characters and/or bounding boxes that a label has been assigned to.

One annotation can have mul. chr., words, lines, areas.

delete() None

Delete Annotation online.

get_link()

Get link to the annotation in the SmartView.

property is_online: Optional[int]

Define if the Annotation is saved to the server.

property offset_string: str

View the string representation of the Annotation.

save(document_annotations: Optional[list] = None) bool

Save Annotation online.

If there is already an annotation in the same place as the current one, we will not be able to save the current annotation.

In that case, we get the id of the original one to be able to track it. The verification of the duplicates is done by checking if the offsets and label match with any annotations online. To be sure that we are comparing with the information online, we need to have the document updated. The update can be done after the request (per annotation) or the updated annotations can be passed as input of the function (advisable when dealing with big documents or documents with many annotations).

Parameters

document_annotations – Annotations in the document (list)

Returns

True if new Annotation was created

Document Class

class konfuzio_sdk.data.Document(id: Optional[int] = None, file_path: Optional[str] = None, file_url: Optional[str] = None, status=None, data_file_name: Optional[str] = None, project=None, is_dataset: Optional[bool] = None, dataset_status: Optional[int] = None, updated_at: Optional[datetime.tzinfo] = None, bbox: Optional[Dict] = None, number_of_pages: Optional[int] = None, *initial_data, **kwargs)

Access the information about one document, which is available online.

add_annotation(annotation, check_duplicate=True)

Add an annotation to a document.

If check_duplicate is True, we only add an annotation after checking it doesn’t exist in the document already. If check_duplicate is False, we add an annotation without checking, but it is considerably faster when the number of annotations in the document is large.

Parameters
  • annotation – Annotation to add in the document

  • check_duplicate – If to check if the annotation already exists in the document

Returns

Input annotation.

annotation_class

alias of konfuzio_sdk.data.Annotation

annotations(label: Optional[konfuzio_sdk.data.Label] = None, use_correct: bool = True, start_offset: Optional[int] = None, end_offset: Optional[int] = None) List

Filter available annotations. Exclude custom_offset_string annotations.

You can specific an offset of a document, to filter the annotations by

Parameters
  • label – Label for which to filter the annotations

  • use_correct – If to filter by correct annotations

  • start_offset – Starting of the offset string (int)

  • end_offset – Ending of the offset string (int)

Returns

Annotations in the document.

delete()

Delete all local information for the document.

get_bbox()

Get bbox information per character of file.

There are two ways to access it: - If the bbox attribute is set when creating the Document, it is returned immediately. - Otherwise, we open the file at bbox_file_path and return its content.

In the second case, we do not store bbox as an attribute on Document because with many big documents this quickly fills the available memory. So it is first written to a file by get_document_details and then retrieved from that file when accessing it.

Returns

Bounding box information per character in the document.

get_document_details(update)

Get data from a document.

Parameters

update – Update the downloaded information even it is already available

get_file(update: bool = False)

Get OCR version of the original file.

Parameters

update – Update the downloaded file even if it is already available

Returns

Path to OCR file.

get_images(update: bool = False)

Get document pages as png images.

Parameters

update – Update the downloaded images even they are already available

Returns

Path to OCR file.

get_text_in_bio_scheme() List[Tuple[str, str]]

Get the text of the document in the BIO scheme.

Returns

list of tuples with each word in the text an the respective label

property is_online: Optional[int]

Define if the Document is saved to the server.

property is_without_errors: bool

Check if the document can be used for training clf.

offset(start_offset: int, end_offset: int) List[konfuzio_sdk.data.Annotation]

Convert an offset to a list of annotations.

Parameters
  • start_offset – Starting of the offset string (int)

  • end_offset – Ending of the offset string (int)

Returns

annotations

property root

Get the path to the folder where all the document information is cached locally.

save() bool

Save or update Document online.

Returns

True if the new document was created or existing document was updated.

update()

Update document information.

Project CLass

class konfuzio_sdk.data.Project(id: int = - 1, offline=False, data_root=False, **kwargs)

Access the information of a project.

add_category(category: konfuzio_sdk.data.Category)

Add category to project, if it does not exist.

Parameters

category – Category to add in the project

add_document(document)

Add document to project, if it does not exist.

Parameters

document – Document to add in the project

add_label(label: konfuzio_sdk.data.Label)

Add label to project, if it does not exist.

Parameters

label – Label to add in the project

add_label_set(label_set: konfuzio_sdk.data.LabelSet)

Add label set to project, if it does not exist.

Parameters

label_set – Label Set to add in the project

annotation_class

alias of konfuzio_sdk.data.Annotation

annotation_set_class

alias of konfuzio_sdk.data.AnnotationSet

category_class

alias of konfuzio_sdk.data.Category

clean_documents(update)

Clean the documents by removing those that have been removed from the App.

Only if to update the project locally.

Parameters

update – Bool to update locally the documents in the project

clean_meta()

Clean the meta-information about the Project, Labels, and Label Sets.

document_class

alias of konfuzio_sdk.data.Document

get(update=False)

Access meta information of the project.

Parameters

update – Update the downloaded information even it is already available

get_categories(update=False)

Get Categories in the project.

Parameters

update – Update the downloaded information even it is already available

Returns

Categories in the project.

get_category_by_id(id: int) konfuzio_sdk.data.Category

Return a Category by ID.

Parameters

id – ID of the Category to get.

get_documents(update=False)

Get all documents in a project which have been marked as available in the training dataset.

Dataset status: training = 2

Parameters

update – Bool to update the meta-information from the project

Returns

training documents

get_documents_by_status(dataset_statuses: List[int] = [0], document_list_cache: List[konfuzio_sdk.data.Document] = [], update: bool = False) List[konfuzio_sdk.data.Document]

Get a list of documents with the specified dataset status from the project.

Besides returning a list, the documents are also initialized in the project. They become accessible from the attributes of the class: self.test_documents, self.none_documents,…

Parameters
  • dataset_statuses – List of status of the documents to get

  • document_list_cache – Cache with documents in the project

  • update – Bool to update the meta-information from the project

Returns

Documents with the specified dataset status

get_label_by_id(id: int) konfuzio_sdk.data.Label

Return a label by ID.

Parameters

id – ID of the label to get.

get_label_set_by_id(id: int) konfuzio_sdk.data.LabelSet

Return a Label Set by ID.

Parameters

id – ID of the Label Set to get.

get_label_sets(update=False)

Get Label Sets in the project.

Parameters

update – Update the downloaded information even it is already available

Returns

Label Sets in the project.

get_labels(update=False)

Get ID and name of any label in the project.

Parameters

update – Update the downloaded information even it is already available

Returns

Labels in the project.

get_meta(update=False)

Get the list of all documents in the project and their information.

Parameters

update – Update the downloaded information even it is already available

Returns

Information of the documents in the project.

get_test_documents(update=False)

Get all documents in a project which have been marked as available in the test dataset.

Dataset status: test = 3

Parameters

update – Bool to update the meta-information from the project

Returns

test documents

label_class

alias of konfuzio_sdk.data.Label

label_set_class

alias of konfuzio_sdk.data.LabelSet

load_annotation_sets()

Load document annotation sets for all training and test documents.

load_categories()

Load categories for all label sets in the project.

make_paths()

Create paths needed to store the project.

update()

Update the project and all documents by downloading them from the host. Note : It will not work offline.

update_document(document)

Update document in the project.

Update can be in the dataset_status, name or category. First, we need to find the document (different list accordingly with dataset_status). Then, if we are just updating the name or category, we can change the fields in place. If we are updating the document dataset status, we need to move the document from the project list.

Parameters

document – Document to update in the project

URLs

[source]

Endpoints of the Konfuzio Host.

konfuzio_sdk.urls.create_new_project_url() str

Generate URL to create a new project.

Returns

URL to create a new project.

konfuzio_sdk.urls.delete_project_api_document_annotations_url(document_id: int, annotation_id: int) str

Delete the annotation of a document.

Parameters
  • document_id – ID of the document as integer

  • annotation_id – ID of the annotation as integer

Returns

URL to delete annotation of a document

konfuzio_sdk.urls.get_auth_token_url() str

Generate URL that creates an authentication token for the user.

Returns

URL to generate the token.

konfuzio_sdk.urls.get_create_label_url() str

Generate URL to create a label.

Returns

URL to create a label.

konfuzio_sdk.urls.get_document_api_details_url(document_id: int, include_extractions: bool = False, extra_fields='bbox') str

Generate URL to access document details of one document in a project.

Parameters
  • document_id – ID of the document as integer

  • include_extractions – Bool to include extractions

  • extra_fields – Extra information to include in the response

Returns

URL to get document details

konfuzio_sdk.urls.get_document_ocr_file_url(document_id: int) str

Generate URL to access OCR version of document.

Parameters

document_id – ID of the document as integer

Returns

URL to get OCR document file.

konfuzio_sdk.urls.get_document_original_file_url(document_id: int) str

Generate URL to access original version of the document.

Parameters

document_id – ID of the document as integer

Returns

URL to get the original document

konfuzio_sdk.urls.get_document_result_v1(document_id: int) str

Generate URL to access web interface for labeling of this project.

Parameters

document_id – ID of the document as integer

Returns

URL for labeling of the project.

konfuzio_sdk.urls.get_document_segmentation_details_url(document_id: int, project_id, action='segmentation') str

Generate URL to get the segmentation results of a document.

Parameters
  • document_id – ID of the document as integer

  • project_id – ID of the project

  • action – Action from where to get the results

Returns

URL to access the segmentation results of a document

konfuzio_sdk.urls.get_documents_meta_url() str

Generate URL to load meta information about documents.

Returns

URL to get all the documents details.

konfuzio_sdk.urls.get_project_list_url() str

Generate URL to load all the projects available for the user.

Returns

URL to get all the projects for the user.

konfuzio_sdk.urls.get_project_url(project_id=None) str

Generate URL to get project details.

Parameters

project_id – ID of the project

Returns

URL to get project details.

konfuzio_sdk.urls.get_upload_document_url() str

Generate URL to upload a document.

Returns

URL to upload a document

konfuzio_sdk.urls.post_project_api_document_annotations_url(document_id: int) str

Add new annotations to a document.

Parameters

document_id – ID of the document as integer

Returns

URL for adding annotations to a document

konfuzio_sdk.urls.update_document_url(document_id: int) str

Generate URL to update a document.

Returns

URL to update a document

Utils

[source]

Utils for the konfuzio sdk package.

konfuzio_sdk.utils.convert_to_bio_scheme(text: str, annotations: List) List[Tuple[str, str]]

Mark all the entities in the text as per the BIO scheme.

The splitting is using the sequence of words, expecting some characters like “.” a separate token.

Hello O , O it O ‘s O Konfuzio B-ORG . O

The start and end offsets are considered having the origin in the begining of the input text. If only part of the text of the document is passed, the start and end offsets of the annotations must be adapted first.

Parameters
  • text – text to be annotated in the bio scheme

  • annotations – annotations in the document with start and end offset and label name

Returns

list of tuples with each word in the text an the respective label

konfuzio_sdk.utils.does_not_raise()

Serve a complement to raise, no-op context manager does_not_raise.

docs.pytest.org/en/latest/example/parametrize.html#parametrizing-conditional-raising

konfuzio_sdk.utils.get_file_type(input_file: Optional[Union[str, _io.BytesIO, bytes]] = None) str

Get the type of a file via the filetype library, which checks the magic bytes to see the internal format.

Parameters

input_file – Path to the file or file in bytes format

Returns

Name of file type

konfuzio_sdk.utils.get_id(a_string, include_time: bool = False) int

Generate a unique ID.

Parameters
  • a_string – String used to generating the unique ID

  • include_time – Bool to include the time in the unique ID

Returns

Unique ID

konfuzio_sdk.utils.get_timestamp(format='%Y-%m-%d-%H-%M-%S') str

Return formatted timestamp.

Parameters

format – Format of the timestamp (e.g. year-month-day-hour-min-sec)

Returns

Timestamp

konfuzio_sdk.utils.is_file(file_path, raise_exception=True, maximum_size=100000000, allow_empty=False) bool

Check if file is available or raise error if it does not exist.

Parameters
  • file_path – Path to the file to be checked

  • raise_exception – Will raise an exception if file is not available

  • maximum_size – Maximum size of the expected file, default < 100 mb

  • allow_empty – Bool to allow empty files

Returns

True or false depending on the existence of the file

konfuzio_sdk.utils.load_image(input_file: Union[str, _io.BytesIO])

Load an image by path or via io.Bytes, e.g. via download by URL.

Parameters

input_file – Path to image or image in bytes format

Returns

Loaded image