Data

[source]

Handle data from the API.

Span

class konfuzio_sdk.data.Span(start_offset: int, end_offset: int, annotation=None)

A Span is a sequence of characters or whitespaces without line break.

bbox() konfuzio_sdk.data.Bbox

Calculate the bounding box of a text sequence.

eval_dict()

Return any information needed to evaluate the Span.

property line_index: int

Return index of the line of the Span.

property normalized

Normalize the offset string.

property offset_string: Optional[str]

Calculate the offset string of a Span.

property page: konfuzio_sdk.data.Page

Return Page of Span.

regex()

Suggest a Regex for the offset string.

Annotation

class konfuzio_sdk.data.Annotation(document: konfuzio_sdk.data.Document, annotation_set_id: Optional[int] = None, annotation_set: Optional[konfuzio_sdk.data.AnnotationSet] = None, label: Optional[Union[int, konfuzio_sdk.data.Label]] = None, label_set_id: Optional[int] = None, label_set: Union[None, konfuzio_sdk.data.LabelSet] = None, is_correct: bool = False, revised: bool = False, normalized=None, id_: Optional[int] = None, spans=None, accuracy: Optional[float] = None, confidence: Optional[float] = None, created_by: Optional[int] = None, revised_by: Optional[int] = None, translated_string: Optional[str] = None, custom_offset_string: bool = False, offset_string: str = False, *args, **kwargs)

Hold information that a Label, Label Set and Annotation Set has been assigned to and combines Spans.

add_span(span: konfuzio_sdk.data.Span)

Add a Span to an Annotation incl. a duplicate check per Annotation.

delete() None

Delete Annotation online.

property end_offset: int

Legacy: One Annotation can have multiple end offsets.

property eval_dict: List[dict]

Calculate the Span information to evaluate the Annotation.

get_link()

Get link to the Annotation in the SmartView.

property is_multiline: int

Calculate if Annotation spans multiple lines of text.

property normalize: str

Provide one normalized offset string due to legacy.

property offset_string: List[str]

View the string representation of the Annotation.

regex()

Return regex of this annotation.

regex_annotation_generator(regex_list) List[konfuzio_sdk.data.Span]

Build Spans without Labels by regexes.

Returns

Return sorted list of Spans by start_offset

save(document_annotations: Optional[list] = None) bool

Save Annotation online.

If there is already an Annotation in the same place as the current one, we will not be able to save the current annotation.

In that case, we get the id_ of the original one to be able to track it. The verification of the duplicates is done by checking if the offsets and Label match with any Annotations online. To be sure that we are comparing with the information online, we need to have the Document updated. The update can be done after the request (per annotation) or the updated Annotations can be passed as input of the function (advisable when dealing with big Documents or Documents with many Annotations).

Parameters

document_annotations – Annotations in the Document (list)

Returns

True if new Annotation was created

property spans: List[konfuzio_sdk.data.Span]

Return default entry to get all Spans of the Annotation.

property start_offset: int

Legacy: One Annotation can have multiple start offsets.

token_append(new_regex, regex_quality: int)

Append token if it is not a duplicate.

tokens() List[str]

Create a list of potential tokens based on Spans of this Annotation.

Annotation Set

class konfuzio_sdk.data.AnnotationSet(document, label_set: konfuzio_sdk.data.LabelSet, id_: Optional[int] = None, **kwargs)

An Annotation Set is a group of Annotations. The Labels of those Annotations refer to the same Label Set.

property annotations

All Annotations currently in this Annotation Set.

property end_offset

Calculate the end based on all Annotations currently in this Annotation Set.

lose_weight()

Delete data of the instance.

property start_line_index

Calculate starting line of this Annotation Set.

property start_offset

Calculate the earliest start based on all Annotations currently in this Annotation Set.

Label

class konfuzio_sdk.data.Label(project, id_: Optional[int] = None, text: Optional[str] = None, get_data_type_display: str = 'Text', text_clean: Optional[str] = None, description: Optional[str] = None, label_sets=None, has_multiple_top_candidates: bool = False, threshold: float = 0.0, *initial_data, **kwargs)

Group Annotations across Label Sets.

add_label_set(label_set: konfuzio_sdk.data.LabelSet)

Add Label Set to label, if it does not exist.

Parameters

label_set – Label set to add

annotations(categories: List[konfuzio_sdk.data.Category], use_correct=True)

Return related Annotations. Consider that one Label can be used across Label Sets in multiple Categories.

combined_tokens(categories: List[konfuzio_sdk.data.Category])

Create one OR Regex for all relevant Annotations tokens.

evaluate_regex(regex, category: konfuzio_sdk.data.Category, annotations: Optional[List[konfuzio_sdk.data.Annotation]] = None, filtered_group=None, regex_quality=0)

Evaluate a regex on Categories.

Type of regex allows you to group regex by generality

Example:

Three Annotations about the birthdate in two Documents and one regex to be evaluated 1.doc: “My was born at the 12th of December 1980, you could also say 12.12.1980.” (2 Annotations) 2.doc: “My was born at 12.06.1997.” (1 Annotations) regex: dd.dd.dddd (without escaped characters for easier reading) stats:

total_correct_findings: 2 correct_label_annotations: 3 total_findings: 2 –> precision 100 % num_docs_matched: 2 Project.documents: 2 –> Document recall 100%

find_regex(category: konfuzio_sdk.data.Category) List[str]

Find the best combination of regex in the list of all regex proposed by Annotations.

find_tokens(category: konfuzio_sdk.data.Category) List

Calculate the regex token of a label, which matches all offset_strings of all correct Annotations.

regex(categories: List[konfuzio_sdk.data.Category], update=False) List

Calculate regex to be used in the LabelExtractionModel.

tokens(categories: List[konfuzio_sdk.data.Category], update=False) dict

Calculate tokens to be used in the regex of the Label.

Label Set

class konfuzio_sdk.data.LabelSet(project, labels=None, id_: Optional[int] = None, name: Optional[str] = None, name_clean: Optional[str] = None, is_default=False, categories=None, has_multiple_annotation_sets=False, **kwargs)

A Label Set is a group of labels.

add_category(category: konfuzio_sdk.data.Category)

Add Category to Project, if it does not exist.

Parameters

category – Category to add in the Project

add_label(label)

Add Label to Label Set, if it does not exist.

Parameters

label – Label ID to be added

Category

class konfuzio_sdk.data.Category(project, id_: Optional[int] = None, name: Optional[str] = None, name_clean: Optional[str] = None, *args, **kwargs)

Group Documents in a Project.

add_label_set(label_set)

Add Label Set to Category.

documents()

Filter for Documents of this Category.

property labels

Return the Labels that belong to the Category and it’s Label Sets.

test_documents()

Filter for test Documents of this Category.

Document

class konfuzio_sdk.data.Document(project: konfuzio_sdk.data.Project, id_: Optional[int] = None, file_url: Optional[str] = None, status=None, data_file_name: Optional[str] = None, is_dataset: Optional[bool] = None, dataset_status: Optional[int] = None, updated_at: Optional[str] = None, assignee: Optional[int] = None, category_template: Optional[int] = None, category: Optional[konfuzio_sdk.data.Category] = None, text: Optional[str] = None, bbox: Optional[dict] = None, pages: Optional[list] = None, update: Optional[bool] = None, copy_of_id: Optional[int] = None, *args, **kwargs)

Access the information about one document, which is available online.

add_annotation(annotation: konfuzio_sdk.data.Annotation)

Add an annotation to a document.

Parameters

annotation – Annotation to add in the document

Returns

Input annotation.

add_annotation_set(annotation_set: konfuzio_sdk.data.AnnotationSet)

Add the Annotation Sets to the document.

add_page(page: konfuzio_sdk.data.Page)

Add a Page to a Document.

annotation_sets()

Return Annotation Sets of Documents.

annotations(label: Optional[konfuzio_sdk.data.Label] = None, use_correct: bool = True, start_offset: int = 0, end_offset: Optional[int] = None, fill: bool = False) List[konfuzio_sdk.data.Annotation]

Filter available annotations.

Parameters
  • label – Label for which to filter the Annotations.

  • use_correct – If to filter by correct annotations.

Returns

Annotations in the document.

property bboxes: Dict[int, konfuzio_sdk.data.Bbox]

Use the cached bbox version.

check_annotations(update_document: bool = False) bool

Check if Annotations are valid - no duplicates and correct Category.

check_bbox() bool

Please see get_bbox of the Document.

delete()

Delete all local information for the document.

property document_folder

Get the path to the folder where all the Document information is cached locally.

download_document_details()

Retrieve data from a Document online in case documented has finished processing.

eval_dict(use_correct=False) List[dict]

Use this dict to evaluate Documents. The speciality: For every Span of an Annotation create one entry.

evaluate_regex(regex, label: konfuzio_sdk.data.Label, annotations: Optional[List[konfuzio_sdk.data.Annotation]] = None, filtered_group=None)

Evaluate a regex based on the Document.

property file_path

Return path to file.

get_annotation_set_by_id(id_: int) konfuzio_sdk.data.AnnotationSet

Return a Label Set by ID.

Parameters

id – ID of the Label Set to get.

get_annotations() List[konfuzio_sdk.data.Annotation]

Get Annotations of the Document.

get_bbox() Dict

Get bbox information per character of file. We don’t store bbox as an attribute to save memory.

Returns

Bounding box information per character in the document.

get_file(ocr_version: bool = True, update: bool = False)

Get OCR version of the original file.

Parameters
  • ocr_version – Bool to get the ocr version of the original file

  • update – Update the downloaded file even if it is already available

Returns

Path to the selected file.

get_images(update: bool = False)

Get Document Pages as PNG images.

Parameters

update – Update the downloaded images even they are already available

Returns

Path to PNG files.

get_page_by_index(page_index: int)

Return the Page by index.

get_text_in_bio_scheme(update=False) List[Tuple[str, str]]

Get the text of the Document in the BIO scheme.

Parameters

update – Update the bio annotations even they are already available

Returns

list of tuples with each word in the text and the respective label

property hocr

Get HOCR of Document. Once loaded stored in memory.

property no_label_annotation_set: konfuzio_sdk.data.AnnotationSet

Return the Annotation Set for project.no_label Annotations.

We need to load the Annotation Sets from Server first (call self.annotation_sets()). If we create the no_label_annotation_set in the first place, the data from the Server is not be loaded anymore because _annotation_sets will no longer be None.

property number_of_lines: int

Calculate the number of lines.

property number_of_pages: int

Calculate the number of pages.

property ocr_file_path

Return path to OCR PDF file.

pages() List[konfuzio_sdk.data.Page]

Get Pages of Document.

regex(start_offset: int, end_offset: int, search=None, max_findings_per_page=100) List[str]

Suggest a list of regex which can be used to get the Span of a document.

spans(label: Optional[konfuzio_sdk.data.Label] = None, use_correct: bool = False, start_offset: int = 0, end_offset: Optional[int] = None, fill: bool = False) List[konfuzio_sdk.data.Span]

Return all Spans of the Document.

property text

Get Document text. Once loaded stored in memory.

update()

Update document information.

view_annotations() List[konfuzio_sdk.data.Annotation]

Get the best Annotations, where the Spans are not overlapping.

Project

class konfuzio_sdk.data.Project(id_: Optional[int], project_folder=None, update=False, **kwargs)

Access the information of a Project.

add_category(category: konfuzio_sdk.data.Category)

Add Category to Project, if it does not exist.

Parameters

category – Category to add in the Project

add_document(document: konfuzio_sdk.data.Document)

Add Document to Project, if it does not exist.

add_label(label: konfuzio_sdk.data.Label)

Add Label to Project, if it does not exist.

Parameters

label – Label to add in the Project

add_label_set(label_set: konfuzio_sdk.data.LabelSet)

Add Label Set to Project, if it does not exist.

Parameters

label_set – Label Set to add in the Project

delete()

Delete the Project folder.

property documents

Return Documents with status training.

property documents_folder: str

Calculate the regex folder of the Project.

property excluded_documents

Return Documents which have been excluded.

get(update=False)

Access meta information of the Project.

Parameters

update – Update the downloaded information even it is already available

get_categories()

Load Categories for all Label Sets in the Project.

get_category_by_id(id_: int) konfuzio_sdk.data.Category

Return a Category by ID.

Parameters

id – ID of the Category to get.

get_document_by_id(document_id: int) konfuzio_sdk.data.Document

Return Document by its ID.

get_label_by_id(id_: int) konfuzio_sdk.data.Label

Return a Label by ID.

Parameters

id – ID of the Label to get.

get_label_by_name(name: str) konfuzio_sdk.data.Label

Return Label by its name.

get_label_set_by_id(id_: int) konfuzio_sdk.data.LabelSet

Return a Label Set by ID.

Parameters

id – ID of the Label Set to get.

get_label_set_by_name(name: str) konfuzio_sdk.data.LabelSet

Return a Label Set by ID.

get_label_sets()

Get Label Sets in the Project.

get_labels() konfuzio_sdk.data.Label

Get ID and name of any Label in the Project.

get_meta()

Get the list of all Documents in the Project and their information.

Returns

Information of the Documents in the Project.

init_or_update_document()

Initialize Document to then decide about full, incremental or no update.

lose_weight()

Delete data of the instance.

property model_folder: str

Calculate the model folder of the Project.

property no_status_documents

Return Documents with status test.

property preparation_documents

Return Documents with status test.

property project_folder: str

Calculate the data document_folder of the Project.

property regex_folder: str

Calculate the regex folder of the Project.

property test_documents

Return Documents with status test.

property virtual_documents

Return Documents created virtually.

write_project_files()

Overwrite files with Project, Label, Label Set information.

Tokenizer

[source]

Abstract Tokenizer

Generic tokenizer.

class konfuzio_sdk.tokenizer.base.AbstractTokenizer

Abstract definition of a Tokenizer.

evaluate(document: konfuzio_sdk.data.Document) pandas.core.frame.DataFrame

Compare a Document with its tokenized version.

Parameters

document – Document to evaluate

Returns

Evaluation DataFrame and Processing time DataFrame.

abstract fit(category: konfuzio_sdk.data.Category)

Fit the tokenizer accordingly with the Documents of the Category.

get_runtime_info() pandas.core.frame.DataFrame

Get the processing runtime information as DataFrame.

Returns

processing time Dataframe containing the processing duration of all steps of the tokenization.

lose_weight()

Delete processing steps.

missing_spans(document: konfuzio_sdk.data.Document) konfuzio_sdk.data.Document

Apply a Tokenizer on a list of Document and remove all Spans that can be found.

Use this approach to sequentially work on remaining Spans after a Tokenizer ran on a List of Documents.

Parameters
  • tokenizer – A Tokenizer that runs on a list of Documents

  • documents – Any list of Documents

Returns

A new Document containing all missing Spans contained in a copied version of all Documents.

abstract tokenize(document: konfuzio_sdk.data.Document)

Create Annotations with 1 Span based on the result of the Tokenizer.

class konfuzio_sdk.tokenizer.base.ListTokenizer(tokenizers: List[konfuzio_sdk.tokenizer.base.AbstractTokenizer])

Use multiple tokenizers.

fit(category: konfuzio_sdk.data.Category)

Call fit on all tokenizers.

lose_weight()

Delete processing steps.

tokenize(document: konfuzio_sdk.data.Document) konfuzio_sdk.data.Document

Run tokenize in the given order on a Document.

class konfuzio_sdk.tokenizer.base.ProcessingStep(tokenizer: str, document: konfuzio_sdk.data.Document, runtime: float)

Track runtime of Tokenizer functions.

eval_dict()

Return any information needed to evaluate the ProcessingStep.

Rule Based Tokenizer

Regex tokenizers.

class konfuzio_sdk.tokenizer.regex.CapitalizedTextTokenizer

Tokenizer based on capitalized text.

Example:

“Company is Company A&B GmbH now” -> “Company A&B GmbH”

class konfuzio_sdk.tokenizer.regex.ColonOrWhitespacePrecededTokenizer

Tokenizer based on text preceded by colon.

Example:

“write to: name” -> “name”

class konfuzio_sdk.tokenizer.regex.ColonPrecededTokenizer

Tokenizer based on text preceded by colon.

Example:

“write to: name” -> “name”

class konfuzio_sdk.tokenizer.regex.ConnectedTextTokenizer

Tokenizer based on text connected by 1 whitespace.

Example:

r”This is na description. Occupies a paragraph.” -> “This is”, “a description. Occupies a paragraph.”

class konfuzio_sdk.tokenizer.regex.LineUntilCommaTokenizer

Tokenizer based on text preceded by colon.

Example:

“n Company und A&B GmbH,n” -> “Company und A&B GmbH”

class konfuzio_sdk.tokenizer.regex.NonTextTokenizer

Tokenizer based on non text - numbers and separators.

Example:

“date 01. 01. 2022” -> “01. 01. 2022”

class konfuzio_sdk.tokenizer.regex.NumbersTokenizer

Tokenizer based on numbers.

Example:

“N. 1242022 123 ” -> “1242022 123”

class konfuzio_sdk.tokenizer.regex.RegexTokenizer(regex: str)

Tokenizer based on a single regex.

fit(category: konfuzio_sdk.data.Category)

Fit the tokenizer accordingly with the Documents of the Category.

tokenize(document: konfuzio_sdk.data.Document) konfuzio_sdk.data.Document

Create Annotations with 1 Span based on the result of the Tokenizer.

Parameters

document – Document to tokenize, can have been tokenized before

Returns

Document with Spans created by the Tokenizer.

class konfuzio_sdk.tokenizer.regex.WhitespaceNoPunctuationTokenizer

Tokenizer based on whitespaces without punctuation.

Example:

“street Name 1-2b,” -> “street”, “Name”, “1-2b”

class konfuzio_sdk.tokenizer.regex.WhitespaceTokenizer

Tokenizer based on whitespaces.

Example:

“street Name 1-2b,” -> “street”, “Name”, “1-2b,”