API Reference

Reference guides are technical descriptions of the machinery and how to operate it. Reference material is information-oriented.

Data

[source]

Handle data from the API.

Span

class konfuzio_sdk.data.Span(start_offset: int, end_offset: int, annotation=None, strict_validation: bool = True)

A Span is a sequence of characters or whitespaces without line break.

bbox() konfuzio_sdk.data.Bbox

Calculate the bounding box of a text sequence.

bbox_dict() Dict

Return Span Bbox info as a serializable Dict format for external integration with the Konfuzio Server.

eval_dict()

Return any information needed to evaluate the Span.

property line_index: int

Return index of the line of the Span.

property normalized

Normalize the offset string.

property offset_string: Optional[str]

Calculate the offset string of a Span.

property page: konfuzio_sdk.data.Page

Return Page of Span.

regex()

Suggest a Regex for the offset string.

Bbox

class konfuzio_sdk.data.Bbox(x0: int, x1: int, y0: int, y1: int, page: konfuzio_sdk.data.Page, validation=BboxValidationTypes.ALLOW_ZERO_SIZE)

A bounding box relates to an area of a Document Page.

What consistutes a valid Bbox changes depending on the value of the validation param. If ALLOW_ZERO_SIZE (default), it allows bounding boxes to have zero width or height. This option is available for compatibility reasons since some OCR engines can sometimes return character level bboxes with zero width or height. If STRICT, it doesn’t allow zero size bboxes. If DISABLED, it allows bboxes that have negative size, or coordinates beyond the Page bounds. For the default behaviour see https://dev.konfuzio.com/sdk/tutorials.html#data-validation-rules.

Parameters

validation – One of ALLOW_ZERO_SIZE (default), STRICT, or DISABLED.

property area

Return area covered by the Bbox.

Annotation

class konfuzio_sdk.data.Annotation(document: konfuzio_sdk.data.Document, annotation_set_id: Optional[int] = None, annotation_set: Optional[konfuzio_sdk.data.AnnotationSet] = None, label: Optional[Union[int, konfuzio_sdk.data.Label]] = None, label_set_id: Optional[int] = None, label_set: Union[None, konfuzio_sdk.data.LabelSet] = None, is_correct: bool = False, revised: bool = False, normalized=None, id_: Optional[int] = None, spans=None, accuracy: Optional[float] = None, confidence: Optional[float] = None, created_by: Optional[int] = None, revised_by: Optional[int] = None, translated_string: Optional[str] = None, custom_offset_string: bool = False, offset_string: str = False, *args, **kwargs)

Hold information that a Label, Label Set and Annotation Set has been assigned to and combines Spans.

add_span(span: konfuzio_sdk.data.Span)

Add a Span to an Annotation incl. a duplicate check per Annotation.

property bboxes: List[Dict]

Return the Bbox information for all Spans in serialized format.

This is useful for external integration (e.g. Konfuzio Server).”

delete(delete_online: bool = True) None

Delete Annotation.

Parameters

delete_online – Whether the Annotation is deleted online or only locally.

property end_offset: int

Legacy: One Annotation can have multiple end offsets.

property eval_dict: List[dict]

Calculate the Span information to evaluate the Annotation.

get_link()

Get link to the Annotation in the SmartView.

property is_multiline: int

Calculate if Annotation spans multiple lines of text.

lose_weight()

Delete data of the instance.

property normalize: str

Provide one normalized offset string due to legacy.

property offset_string: List[str]

View the string representation of the Annotation.

regex()

Return regex of this Annotation.

regex_annotation_generator(regex_list) List[konfuzio_sdk.data.Span]

Build Spans without Labels by regexes.

Returns

Return sorted list of Spans by start_offset

save(document_annotations: Optional[list] = None) bool

Save Annotation online.

If there is already an Annotation in the same place as the current one, we will not be able to save the current annotation.

In that case, we get the id_ of the original one to be able to track it. The verification of the duplicates is done by checking if the offsets and Label match with any Annotations online. To be sure that we are comparing with the information online, we need to have the Document updated. The update can be done after the request (per annotation) or the updated Annotations can be passed as input of the function (advisable when dealing with big Documents or Documents with many Annotations).

Parameters

document_annotations – Annotations in the Document (list)

Returns

True if new Annotation was created

property spans: List[konfuzio_sdk.data.Span]

Return default entry to get all Spans of the Annotation.

property start_offset: int

Legacy: One Annotation can have multiple start offsets.

token_append(new_regex, regex_quality: int)

Append token if it is not a duplicate.

tokens() List[str]

Create a list of potential tokens based on Spans of this Annotation.

Annotation Set

class konfuzio_sdk.data.AnnotationSet(document, label_set: konfuzio_sdk.data.LabelSet, id_: Optional[int] = None, **kwargs)

An Annotation Set is a group of Annotations. The Labels of those Annotations refer to the same Label Set.

annotations(use_correct: bool = True, ignore_below_threshold: bool = False)

All Annotations currently in this Annotation Set.

property end_line_index: Optional[int]

Calculate ending line of this Annotation Set.

property end_offset: Optional[int]

Calculate the end based on all Annotations above detection threshold currently in this AnnotationSet.

property start_line_index: Optional[int]

Calculate starting line of this Annotation Set.

property start_offset: Optional[int]

Calculate the earliest start based on all Annotations above detection threshold in this AnnotationSet.

Label

class konfuzio_sdk.data.Label(project, id_: Optional[int] = None, text: Optional[str] = None, get_data_type_display: str = 'Text', text_clean: Optional[str] = None, description: Optional[str] = None, label_sets=None, has_multiple_top_candidates: bool = False, threshold: float = 0.1, *initial_data, **kwargs)

Group Annotations across Label Sets.

add_label_set(label_set: konfuzio_sdk.data.LabelSet)

Add Label Set to label, if it does not exist.

Parameters

label_set – Label Set to add

annotations(categories: List[konfuzio_sdk.data.Category], use_correct=True, ignore_below_threshold=False) List[konfuzio_sdk.data.Annotation]

Return related Annotations. Consider that one Label can be used across Label Sets in multiple Categories.

base_regex(category: konfuzio_sdk.data.Category, annotations: Optional[List[konfuzio_sdk.data.Annotation]] = None) str

Find the best combination of regex in the list of all regex proposed by Annotations.

evaluate_regex(regex, category: konfuzio_sdk.data.Category, annotations: Optional[List[konfuzio_sdk.data.Annotation]] = None, regex_quality=0)

Evaluate a regex on Categories.

Type of regex allows you to group regex by generality

Example:

Three Annotations about the birthdate in two Documents and one regex to be evaluated 1.doc: “My was born at the 12th of December 1980, you could also say 12.12.1980.” (2 Annotations) 2.doc: “My was born at 12.06.1997.” (1 Annotations) regex: dd.dd.dddd (without escaped characters for easier reading) stats:

total_correct_findings: 2 correct_label_annotations: 3 total_findings: 2 –> precision 100 % num_docs_matched: 2 Project.documents: 2 –> Document recall 100%

find_regex(category: konfuzio_sdk.data.Category, max_findings_per_page=100) List[str]

Find the best combination of regex for Label with before and after context.

has_multiline_annotations(categories: Optional[List[konfuzio_sdk.data.Category]] = None) bool

Return if any Label annotations are multi-line.

lose_weight()

Delete data of the instance.

regex(categories: List[konfuzio_sdk.data.Category], update=False) List

Calculate regex to be used in the Extraction AI.

spans_not_found_by_tokenizer(tokenizer, categories: List[konfuzio_sdk.data.Category], use_correct=False) List[konfuzio_sdk.data.Span]

Find Label Spans that are not found by a tokenizer.

Label Set

class konfuzio_sdk.data.LabelSet(project, labels=None, id_: Optional[int] = None, name: Optional[str] = None, name_clean: Optional[str] = None, is_default=False, categories=None, has_multiple_annotation_sets=False, **kwargs)

A Label Set is a group of Labels.

add_category(category: konfuzio_sdk.data.Category)

Add Category to Project, if it does not exist.

Parameters

category – Category to add in the Project

add_label(label)

Add Label to Label Set, if it does not exist.

Parameters

label – Label ID to be added

get_target_names(use_separate_labels: bool)

Get target string name for Annotation Label classification.

Category

class konfuzio_sdk.data.Category(project, id_: Optional[int] = None, name: Optional[str] = None, name_clean: Optional[str] = None, *args, **kwargs)

Group Documents in a Project.

add_label_set(label_set)

Add Label Set to Category.

documents()

Filter for Documents of this Category.

exclusive_first_page_strings(tokenizer) set

Return a set of strings exclusive for first Pages of Documents within the Category.

Parameters

tokenizer – A tokenizer to process Documents before gathering strings.

property fallback_name: str

Turn the Category name to lowercase, remove parentheses along with their contents, and trim spaces.

property labels

Return the Labels that belong to the Category and its Label Sets.

test_documents()

Filter for test Documents of this Category.

Category Annotation

class konfuzio_sdk.data.CategoryAnnotation(category: konfuzio_sdk.data.Category, confidence: Optional[float] = None, page: Optional[konfuzio_sdk.data.Page] = None, document: Optional[konfuzio_sdk.data.Document] = None, id_: Optional[int] = None)

Annotate the Category of a Page.

property confidence: float

Get the confidence of this Category Annotation.

If the confidence was not set, it means it was never predicted by an AI. Thus, the returned value will be 0, unless it was set by a human, in which case it defaults to 1.

Returns

Confidence between 0.0 and 1.0 included.

set_revised() None

Set this Category Annotation as revised by human, and thus the correct one for the linked Page.

Document

class konfuzio_sdk.data.Document(project: konfuzio_sdk.data.Project, id_: Optional[int] = None, file_url: Optional[str] = None, status: Optional[List[Union[int, str]]] = None, data_file_name: Optional[str] = None, is_dataset: Optional[bool] = None, dataset_status: Optional[int] = None, updated_at: Optional[str] = None, assignee: Optional[int] = None, category_template: Optional[int] = None, category: Optional[konfuzio_sdk.data.Category] = None, category_confidence: Optional[float] = None, category_is_revised: bool = False, text: Optional[str] = None, bbox: Optional[dict] = None, bbox_validation_type=None, pages: Optional[list] = None, update: Optional[bool] = None, copy_of_id: Optional[int] = None, *args, **kwargs)

Access the information about one Document, which is available online.

add_annotation(annotation: konfuzio_sdk.data.Annotation)

Add an Annotation to a Document.

The Annotation is only added to the Document if the data validation tests are passing for this Annotation. See https://dev.konfuzio.com/sdk/tutorials.html#data-validation-rules.

Parameters

annotation – Annotation to add in the Document

Returns

Input Annotation.

add_annotation_set(annotation_set: konfuzio_sdk.data.AnnotationSet)

Add the Annotation Sets to the Document.

add_page(page: konfuzio_sdk.data.Page)

Add a Page to a Document.

annotation_sets()

Return Annotation Sets of Documents.

annotations(label: Optional[konfuzio_sdk.data.Label] = None, use_correct: bool = True, ignore_below_threshold: bool = False, start_offset: int = 0, end_offset: Optional[int] = None, fill: bool = False) List[konfuzio_sdk.data.Annotation]

Filter available annotations.

Parameters
  • label – Label for which to filter the Annotations.

  • use_correct – If to filter by correct Annotations.

  • ignore_below_threshold – To filter out Annotations with confidence below Label prediction threshold.

Returns

Annotations in the document.

property bboxes: Dict[int, konfuzio_sdk.data.Bbox]

Use the cached bbox version.

property category: Optional[konfuzio_sdk.data.Category]

Return the Category of the Document.

The Category of a Document is only defined as long as all Pages have the same Category. Otherwise, the Document should probably be split into multiple Documents with a consistent Category assignment within their Pages, or the Category for each Page should be manually revised.

property category_annotations: List[konfuzio_sdk.data.CategoryAnnotation]

Collect Category Annotations and average confidence across all Pages.

Returns

List of Category Annotations, one for each Category.

check_annotations(update_document: bool = False) bool

Check if Annotations are valid - no duplicates and correct Category.

check_bbox() None

Run validation checks on the Document text and bboxes.

This is run when the Document is initialized, and usually it’s not needed to be run again because a Document’s text and bboxes are not expected to change within the Konfuzio Server.

You can run this manually instead if your pipeline allows changing the text or the bbox during the lifetime of a document. Will raise ValueError if the bboxes don’t match with the text of the document, or if bboxes have invalid coordinates (outside page borders) or invalid size (negative width or height).

This check is usually slow, and it can be made faster by calling Document.set_text_bbox_hashes() right after initializing the Document, which will enable running a hash comparison during this check.

create_subdocument_from_page_range(start_page: konfuzio_sdk.data.Page, end_page: konfuzio_sdk.data.Page, include=False)

Create a shorter Document from a Page range of an initial Document.

Parameters
  • start_page (Page) – A Page that the new sub-Document starts with.

  • end_page (Page) – A Page that the new sub-Document ends with, if include is True.

  • include (bool) – Whether end_page is included into the new sub-Document.

Returns

A new sub-Document.

delete(delete_online: bool = False)

Delete Document.

delete_document_details()

Delete all local content information for the Document.

property document_folder

Get the path to the folder where all the Document information is cached locally.

download_document_details()

Retrieve data from a Document online in case Document has finished processing.

eval_dict(use_view_annotations=False, use_correct=False, ignore_below_threshold=False) List[dict]

Use this dict to evaluate Documents. The speciality: For every Span of an Annotation create one entry.

evaluate_regex(regex, label: konfuzio_sdk.data.Label, annotations: Optional[List[konfuzio_sdk.data.Annotation]] = None)

Evaluate a regex based on the Document.

property file_path

Return path to file.

classmethod from_file_async(path: str, project: konfuzio_sdk.data.Project, dataset_status: int = 0, category_id: Optional[int] = None, callback_url: str = '', timeout: Optional[int] = None) int

Initialize Document from file with asynchrinous API call.

This class method asynchronously uploads a file, to the Konfuzio API and returns the ID of the newly created document. Use this method to create a new Document and don’t want to wait for the document to be processed by the server. This requires to update the Project at a later point to be able to work with the new Document.

Parameters
  • path – Path to file to be uploaded

  • project – If to filter by correct annotations

  • dataset_status – Dataset status of the document (None: 0 Preparation: 1 Training: 2 Test: 3 Excluded: 4)

  • category_id – Category the Document belongs to (if unset, it will be assigned one by the server)

  • callback_url – Callback URL receiving POST call once extraction is

  • timeout – Number of seconds to wait for response from the server

Returns

ID of new Document

classmethod from_file_sync(path: str, project: konfuzio_sdk.data.Project, dataset_status: int = 0, category_id: Optional[int] = None, callback_url: str = '', timeout: Optional[int] = None) konfuzio_sdk.data.Document

Initialize Document from file with synchronous API call.

This class method will wait for the document to be processed by the server and then return the new Document. This may take a bit of time. When uploading many documents, it is advised to use the Document.from_file_async method.

Parameters
  • path – Path to file to be uploaded

  • project – If to filter by correct annotations

  • dataset_status – Dataset status of the document (None: 0 Preparation: 1 Training: 2 Test: 3 Excluded: 4)

  • category_id – Category the Document belongs to (if unset, it will be assigned one by the server)

  • callback_url – Callback URL receiving POST call once extraction is done

  • timeout – Number of seconds to wait for response from the server

Returns

New Document

get_annotation_by_id(annotation_id: int) konfuzio_sdk.data.Annotation

Return an Annotation by ID, searching within the Document.

Parameters

annotation_id – ID of the Annotation to get.

get_annotation_set_by_id(id_: int) konfuzio_sdk.data.AnnotationSet

Return an Annotation Set by ID.

Parameters

id – ID of the Annotation Set to get.

get_annotations() List[konfuzio_sdk.data.Annotation]

Get Annotations of the Document.

get_bbox() Dict

Get bbox information per character of file. We don’t store bbox as an attribute to save memory.

Returns

Bounding box information per character in the Document.

get_document_classifier_examples(text_vocab, category_vocab, max_len, use_image, use_text)

Get the per document examples for the document classifier.

get_file(ocr_version: bool = True, update: bool = False)

Get OCR version of the original file.

Parameters
  • ocr_version – Bool to get the ocr version of the original file

  • update – Update the downloaded file even if it is already available

Returns

Path to the selected file.

get_images(update: bool = False)

Get Document Pages as PNG images.

Parameters

update – Update the downloaded images even they are already available

Returns

Path to PNG files.

get_page_by_index(page_index: int)

Return the Page by index.

get_text_in_bio_scheme(update=False) List[Tuple[str, str]]

Get the text of the Document in the BIO scheme.

Parameters

update – Update the bio annotations even they are already available

Returns

list of tuples with each word in the text and the respective label

property hocr

Get HOCR of Document. Once loaded stored in memory.

lose_weight()

Remove NO_LABEL, wrong and below threshold Annotations.

property maximum_confidence_category: Optional[konfuzio_sdk.data.Category]

Get the human revised Category of this Document, or the highest confidence one if not revised.

Returns

The found Category, or None if not present.

property maximum_confidence_category_annotation: Optional[konfuzio_sdk.data.CategoryAnnotation]

Get the human revised Category Annotation of this Document, or the highest confidence one if not revised.

Returns

The found Category Annotation, or None if not present.

property no_label_annotation_set: konfuzio_sdk.data.AnnotationSet

Return the Annotation Set for project.no_label Annotations.

We need to load the Annotation Sets from Server first (call self.annotation_sets()). If we create the no_label_annotation_set in the first place, the data from the Server is not be loaded anymore because _annotation_sets will no longer be None.

property number_of_lines: int

Calculate the number of lines.

property number_of_pages: int

Calculate the number of Pages.

property ocr_file_path

Return path to OCR PDF file.

pages() List[konfuzio_sdk.data.Page]

Get Pages of Document.

propose_splitting(splitting_ai) List

Propose splitting for a multi-file Document.

Parameters

splitting_ai – An initialized SplittingAI class

save()

Save all local changes to Document to server.

save_meta_data()

Save local changes to Document metadata to server.

set_bboxes(characters: Dict[int, konfuzio_sdk.data.Bbox])

Set character Bbox dictionary.

set_category(category: konfuzio_sdk.data.Category) None

Set the Category of the Document and the Category of all of its Pages as revised.

set_text_bbox_hashes() None

Update hashes of Document text and bboxes. Can be used for checking later on if any changes happened.

spans(label: Optional[konfuzio_sdk.data.Label] = None, use_correct: bool = False, start_offset: int = 0, end_offset: Optional[int] = None, fill: bool = False) List[konfuzio_sdk.data.Span]

Return all Spans of the Document.

property text

Get Document text. Once loaded stored in memory.

update()

Update Document information.

update_meta_data(assignee: Optional[int] = None, category_template: Optional[int] = None, category: Optional[konfuzio_sdk.data.Category] = None, data_file_name: Optional[str] = None, dataset_status: Optional[int] = None, status: Optional[List[Union[int, str]]] = None, **kwargs)

Update document metadata information.

view_annotations(start_offset: int = 0, end_offset: Optional[int] = None) List[konfuzio_sdk.data.Annotation]

Get the best Annotations, where the Spans are not overlapping.

Page

class konfuzio_sdk.data.Page(id_: Optional[int], document: konfuzio_sdk.data.Document, number: int, original_size: Tuple[float, float], start_offset: Optional[int] = None, end_offset: Optional[int] = None, category: Optional[konfuzio_sdk.data.Category] = None, copy_of_id: Optional[int] = None)

Access the information about one Page of a Document.

add_category_annotation(category_annotation: konfuzio_sdk.data.CategoryAnnotation)

Annotate a Page with a Category and confidence information.

annotations(label: Optional[konfuzio_sdk.data.Label] = None, use_correct: bool = True, ignore_below_threshold: bool = False, start_offset: int = 0, end_offset: Optional[int] = None, fill: bool = False) List[konfuzio_sdk.data.Annotation]

Get Page Annotations.

property category: Optional[konfuzio_sdk.data.Category]

Get the Category of the Page, based on human revised Category Annotation, or on highest confidence.

get_annotations_image(image: <module 'PIL.Image' from '/opt/hostedtoolcache/Python/3.8.16/x64/lib/python3.8/site-packages/PIL/Image.py'> = None)

Get Document Page as PNG with Annotations shown.

get_bbox()

Get bbox information per character of Page.

get_category_annotation(category, add_if_not_present: bool = False) konfuzio_sdk.data.CategoryAnnotation

Get the Category Annotation corresponding to a Category in this Page.

If no Category Annotation is found with the provided Category, one is created. See the add_if_not_present argument.

Parameters
  • category – The Category to filter for.

  • add_if_not_present – Adds the Category Annotation to the current Page if not present. Otherwise it creates

a dummy Category Annotation, not linked to any Document or Page. :return: The found or created Category Annotation.

get_image(update: bool = False)

Get Document Page as PNG.

property maximum_confidence_category_annotation: Optional[konfuzio_sdk.data.CategoryAnnotation]

Get the human revised Category Annotation of this Page, or the highest confidence one if not revised.

Returns

The found Category Annotation, or None if not present.

property number_of_lines: int

Calculate the number of lines in Page.

set_category(category: konfuzio_sdk.data.Category) None

Set the Category of the Page.

Parameters

category – The Category to set for the Page.

spans(label: Optional[konfuzio_sdk.data.Label] = None, use_correct: bool = False, start_offset: int = 0, end_offset: Optional[int] = None, fill: bool = False) List[konfuzio_sdk.data.Span]

Return all Spans of the Page.

property text

Get Document text corresponding to the Page.

view_annotations() List[konfuzio_sdk.data.Annotation]

Get the best Annotations, where the Spans are not overlapping in Page.

Project

class konfuzio_sdk.data.Project(id_: Optional[int], project_folder=None, update=False, max_ram=None, strict_data_validation: bool = True, **kwargs)

Access the information of a Project.

add_category(category: konfuzio_sdk.data.Category)

Add Category to Project, if it does not exist.

Parameters

category – Category to add in the Project

add_document(document: konfuzio_sdk.data.Document)

Add Document to Project, if it does not exist.

add_label(label: konfuzio_sdk.data.Label)

Add Label to Project, if it does not exist.

Parameters

label – Label to add in the Project

add_label_set(label_set: konfuzio_sdk.data.LabelSet)

Add Label Set to Project, if it does not exist.

Parameters

label_set – Label Set to add in the Project

del_document_by_id(document_id: int, delete_online: bool = False) konfuzio_sdk.data.Document

Delete Document by its ID.

delete()

Delete the Project folder.

property documents

Return Documents with status training.

property documents_folder: str

Calculate the regex folder of the Project.

property excluded_documents

Return Documents which have been excluded.

get(update=False)

Access meta information of the Project.

Parameters

update – Update the downloaded information even it is already available

get_categories()

Load Categories for all Label Sets in the Project.

get_category_by_id(id_: int) konfuzio_sdk.data.Category

Return a Category by ID.

Parameters

id – ID of the Category to get.

get_document_by_id(document_id: int) konfuzio_sdk.data.Document

Return Document by its ID.

get_label_by_id(id_: int) konfuzio_sdk.data.Label

Return a Label by ID.

Parameters

id – ID of the Label to get.

get_label_by_name(name: str) konfuzio_sdk.data.Label

Return Label by its name.

get_label_set_by_id(id_: int) konfuzio_sdk.data.LabelSet

Return a Label Set by ID.

Parameters

id – ID of the Label Set to get.

get_label_set_by_name(name: str) konfuzio_sdk.data.LabelSet

Return a Label Set by ID.

get_label_sets(reload=False)

Get LabelSets in the Project.

get_labels(reload=False) konfuzio_sdk.data.Label

Get ID and name of any Label in the Project.

get_meta(reload=False)

Get the list of all Documents in the Project and their information.

Returns

Information of the Documents in the Project.

init_or_update_document(from_online=False)

Initialize or update Documents from local files to then decide about full, incremental or no update.

Parameters

from_online – If True, all Document metadata info is first reloaded with latest changes in the server

property label_sets

Return Project LabelSets.

property labels

Return Project Labels.

lose_weight()

Delete data of the instance.

property max_ram

Return maximum memory used by AI models.

property meta_data

Return Project meta data.

property model_folder: str

Calculate the model folder of the Project.

property no_status_documents

Return Documents with status test.

property online_documents_dict: Dict

Return a dictionary of online documents using their id as key.

property preparation_documents

Return Documents with status test.

property project_folder: str

Calculate the data document_folder of the Project.

property regex_folder: str

Calculate the regex folder of the Project.

property test_documents

Return Documents with status test.

property virtual_documents

Return Documents created virtually.

write_meta_of_files()

Overwrite meta-data of Documents in Project.

write_project_files()

Overwrite files with Project, Label, Label Set information.

Tokenizers

[source]

Generic tokenizer.

Abstract Tokenizer

class konfuzio_sdk.tokenizer.base.AbstractTokenizer

Abstract definition of a Tokenizer.

evaluate(document: konfuzio_sdk.data.Document) pandas.core.frame.DataFrame

Compare a Document with its tokenized version.

Parameters

document – Document to evaluate

Returns

Evaluation DataFrame

evaluate_dataset(dataset_documents: List[konfuzio_sdk.data.Document]) konfuzio_sdk.evaluate.ExtractionEvaluation

Evaluate the tokenizer on a dataset of documents.

Parameters

dataset_documents – Documents to evaluate

Returns

ExtractionEvaluation instance

abstract fit(category: konfuzio_sdk.data.Category)

Fit the tokenizer accordingly with the Documents of the Category.

abstract found_spans(document: konfuzio_sdk.data.Document) List[konfuzio_sdk.data.Span]

Find all Spans in a Document that can be found by a Tokenizer.

get_runtime_info() pandas.core.frame.DataFrame

Get the processing runtime information as DataFrame.

Returns

processing time Dataframe containing the processing duration of all steps of the tokenization.

lose_weight()

Delete processing steps.

missing_spans(document: konfuzio_sdk.data.Document) List[konfuzio_sdk.data.Span]

Apply a Tokenizer on a Document and find all Spans that cannot be found.

Use this approach to sequentially work on remaining Spans after a Tokenizer ran on a List of Documents.

Parameters

document – A Document

Returns

A list containing all missing Spans.

abstract tokenize(document: konfuzio_sdk.data.Document) konfuzio_sdk.data.Document

Create Annotations with 1 Span based on the result of the Tokenizer.

Parameters

document – Document to tokenize, can have been tokenized before

Returns

Document with Spans created by the Tokenizer.

List Tokenizer

class konfuzio_sdk.tokenizer.base.ListTokenizer(tokenizers: List[konfuzio_sdk.tokenizer.base.AbstractTokenizer])

Use multiple tokenizers.

fit(category: konfuzio_sdk.data.Category)

Call fit on all tokenizers.

found_spans(document: konfuzio_sdk.data.Document) List[konfuzio_sdk.data.Span]

Run found_spans in the given order on a Document.

lose_weight()

Delete processing steps.

span_match(span: konfuzio_sdk.data.Span) bool

Run span_match in the given order.

tokenize(document: konfuzio_sdk.data.Document) konfuzio_sdk.data.Document

Run tokenize in the given order on a Document.

Rule Based Tokenizer

Regex tokenizers.

Regex Tokenizer

class konfuzio_sdk.tokenizer.regex.RegexTokenizer(regex: str)

Tokenizer based on a single regex.

fit(category: konfuzio_sdk.data.Category)

Fit the tokenizer accordingly with the Documents of the Category.

found_spans(document: konfuzio_sdk.data.Document) List[konfuzio_sdk.data.Span]

Find Spans found by the Tokenizer and add Tokenizer info to Span.

Parameters

document – Document with Annotation to find.

Returns

List of Spans found by the Tokenizer.

span_match(span: konfuzio_sdk.data.Span) bool

Check if Span is detected by Tokenizer.

tokenize(document: konfuzio_sdk.data.Document) konfuzio_sdk.data.Document

Create Annotations with 1 Span based on the result of the Tokenizer.

Parameters

document – Document to tokenize, can have been tokenized before

Returns

Document with Spans created by the Tokenizer.

Regex tokenizers.

class konfuzio_sdk.tokenizer.regex.CapitalizedTextTokenizer

Tokenizer based on capitalized text.

Example:

“Company is Company A&B GmbH now” -> “Company A&B GmbH”

class konfuzio_sdk.tokenizer.regex.ColonOrWhitespacePrecededTokenizer

Tokenizer based on text preceded by colon.

Example:

“write to: name” -> “name”

class konfuzio_sdk.tokenizer.regex.ColonPrecededTokenizer

Tokenizer based on text preceded by colon.

Example:

“write to: name” -> “name”

class konfuzio_sdk.tokenizer.regex.ConnectedTextTokenizer

Tokenizer based on text connected by 1 whitespace.

Example:

r”This is na description. Occupies a paragraph.” -> “This is”, “a description. Occupies a paragraph.”

class konfuzio_sdk.tokenizer.regex.LineUntilCommaTokenizer

Tokenizer based on text preceded by colon.

Example:

“n Company und A&B GmbH,n” -> “Company und A&B GmbH”

class konfuzio_sdk.tokenizer.regex.NonTextTokenizer

Tokenizer based on non text - numbers and separators.

Example:

“date 01. 01. 2022” -> “01. 01. 2022”

class konfuzio_sdk.tokenizer.regex.NumbersTokenizer

Tokenizer based on numbers.

Example:

“N. 1242022 123 ” -> “1242022 123”

class konfuzio_sdk.tokenizer.regex.RegexTokenizer(regex: str)

Tokenizer based on a single regex.

fit(category: konfuzio_sdk.data.Category)

Fit the tokenizer accordingly with the Documents of the Category.

found_spans(document: konfuzio_sdk.data.Document) List[konfuzio_sdk.data.Span]

Find Spans found by the Tokenizer and add Tokenizer info to Span.

Parameters

document – Document with Annotation to find.

Returns

List of Spans found by the Tokenizer.

span_match(span: konfuzio_sdk.data.Span) bool

Check if Span is detected by Tokenizer.

tokenize(document: konfuzio_sdk.data.Document) konfuzio_sdk.data.Document

Create Annotations with 1 Span based on the result of the Tokenizer.

Parameters

document – Document to tokenize, can have been tokenized before

Returns

Document with Spans created by the Tokenizer.

class konfuzio_sdk.tokenizer.regex.WhitespaceNoPunctuationTokenizer

Tokenizer based on whitespaces without punctuation.

Example:

“street Name 1-2b,” -> “street”, “Name”, “1-2b”

class konfuzio_sdk.tokenizer.regex.WhitespaceTokenizer

Tokenizer based on whitespaces.

Example:

“street Name 1-2b,” -> “street”, “Name”, “1-2b,”

Extraction AI

[source]

Extract information from Documents.

Conventional template matching based approaches fail to generalize well to document images of unseen templates, and are not robust against text recognition errors.

We follow the approach proposed by Sun et al. (2021) to encode both the visual and textual features of detected text regions, and edges of which represent the spatial relations between neighboring text regions. Their experiments validate that all information including visual features, textual features and spatial relations can benefit key information extraction.

We reduce the hardware requirements from 1 NVIDIA Titan X GPUs with 12 GB memory to a 1 CPU and 16 GB memory by replacing the end-to-end pipeline into two parts.

Sun, H., Kuang, Z., Yue, X., Lin, C., & Zhang, W. (2021). Spatial Dual-Modality Graph Reasoning for Key Information Extraction. arXiv. https://doi.org/10.48550/ARXIV.2103.14470

Base Model

class konfuzio_sdk.trainer.information_extraction.BaseModel

Base model to define common methods for all AIs.

abstract check_is_ready()

Check if the Model is ready for inference.

ensure_model_memory_usage_within_limit(max_ram: Optional[str] = None)

Ensure that a model is not exceeding allowed max_ram.

Parameters

max_ram (str) – Specify maximum memory usage condition to save model.

abstract static has_compatible_interface(other)

Validate that an instance of an AI implements the same interface defined by this AI class.

Parameters

other – An instance of an AI to compare with.

property name

Model class name.

name_lower()

Convert class name to machine-readable name.

abstract property pkl_file_path

Generate a path for a resulting pickle file.

reduce_model_weight()

Remove all non-strictly necessary parameters before saving.

save(output_dir: Optional[str] = None, include_konfuzio=True, reduce_weight=True, keep_documents=False, max_ram=None)

Save the label model as bz2 compressed pickle object to the release directory.

Saving is done by: getting the serialized pickle object (via cloudpickle), “optimizing” the serialized object with the built-in pickletools.optimize function (see: https://docs.python.org/3/library/pickletools.html), saving the optimized serialized object.

We then compress the pickle file with bz2 using shutil.copyfileobject which writes in chunks to avoid loading the entire pickle file in memory.

Finally, we delete the cloudpickle file and are left with the bz2 file which has a .pkl extension.

Parameters
  • output_dir – Folder to save AI model in. If None, the default Project folder is used.

  • include_konfuzio – Enables pickle serialization as a value, not as a reference (for more info, read

https://github.com/cloudpipe/cloudpickle#overriding-pickles-serialization-mechanism-for-importable-constructs). :param reduce_weight: Remove all non-strictly necessary parameters before saving. :param max_ram: Specify maximum memory usage condition to save model. :raises MemoryError: When the size of the model in memory is greater than the maximum value. :return: Path of the saved model file.

abstract property temp_pkl_file_path

Generate a path for temporary pickle file.

Trainer

class konfuzio_sdk.trainer.information_extraction.Trainer(category: konfuzio_sdk.data.Category, *args, **kwargs)

Parent class for all Extraction AIs, to extract information from unstructured human readable text.

static add_extractions_as_annotations(extractions: pandas.core.frame.DataFrame, document: konfuzio_sdk.data.Document, label: konfuzio_sdk.data.Label, label_set: konfuzio_sdk.data.LabelSet, annotation_set: konfuzio_sdk.data.AnnotationSet) None

Add the extraction of a model to the document.

evaluate()

Use as placeholder Function.

extract()

Use as placeholder Function.

extraction_result_to_document(document: konfuzio_sdk.data.Document, extraction_result: dict) konfuzio_sdk.data.Document

Return a virtual Document annotated with AI Model output.

fit()

Use as placeholder Function.

static flush_buffer(buffer: List[pandas.core.series.Series], doc_text: str) Dict

Merge a buffer of entities into a dictionary (which will eventually be turned into a DataFrame).

A buffer is a list of pandas.Series objects.

static has_compatible_interface(other) bool

Validate that an instance of an Extraction AI implements the same interface as Trainer.

An Extraction AI should implement methods with the same signature as: - Trainer.__init__ - Trainer.fit - Trainer.extract - Trainer.check_is_ready

Parameters

other – An instance of an Extraction AI to compare with.

static is_valid_horizontal_merge(row: pandas.core.series.Series, buffer: List[pandas.core.series.Series], doc_text: str, max_offset_distance: int = 5) bool

Verify if the merging that we are trying to do is valid.

A merging is valid only if:
  • All spans have the same predicted Label

  • Confidence of predicted Label is above the Label threshold

  • All spans are on the same line

  • No extraneous characters in between spans

  • A maximum of 5 spaces in between spans

  • The Label type is not one of the following: ‘Number’, ‘Positive Number’, ‘Percentage’, ‘Date’ OR the resulting merging create a span normalizable to the same type

Parameters
  • row – Row candidate to be merged to what is already in the buffer.

  • buffer – Previous information.

  • doc_text – Text of the document.

  • max_offset_distance – Maximum distance between two entities that can be merged.

Returns

If the merge is valid or not.

classmethod merge_horizontal(res_dict: Dict, doc_text: str) Dict

Merge contiguous spans with same predicted label.

See more details at https://dev.konfuzio.com/sdk/explanations.html#horizontal-merge

Random Forest Extraction AI

class konfuzio_sdk.trainer.information_extraction.RFExtractionAI(n_nearest: int = 2, first_word: bool = True, n_estimators: int = 100, max_depth: int = 100, no_label_limit: Optional[Union[int, float]] = None, n_nearest_across_lines: bool = False, use_separate_labels: bool = True, category: Optional[konfuzio_sdk.data.Category] = None, tokenizer=None, *args, **kwargs)

Encode visual and textual features to extract text regions.

Fit a extraction pipeline to extract linked Annotations.

Both Label and Label Set classifiers are using a RandomForestClassifier from scikit-learn to run in a low memory and single CPU environment. A random forest classifier is a group of decision trees classifiers, see: https://scikit-learn.org/stable/modules/generated/sklearn.ensemble.RandomForestClassifier.html

The parameters of this class allow to select the Tokenizer, to configure the Label and Label Set classifiers and to select the type of features used by the Label and Label Set classifiers.

They are divided in: - tokenizer selection - parametrization of the Label classifier - parametrization of the Label Set classifier - features for the Label classifier - features for the Label Set classifier

By default, the text of the Documents is split into smaller chunks of text based on whitespaces (‘WhitespaceTokenizer’). That means that all words present in the text will be shown to the AI. It is possible to define if the splitting of the text into smaller chunks should be done based on regexes learned from the Spans of the Annotations of the Category (‘tokenizer_regex’) or if to use a model from Spacy library for German language (‘tokenizer_spacy’). Another option is to use a pre-defined list of tokenizers based on regexes (‘tokenizer_regex_list’) and, on top of the pre-defined list, to create tokenizers that match what is missed by those (‘tokenizer_regex_combination’).

Some parameters of the scikit-learn RandomForestClassifier used for the Label and/or Label Set classifier can be set directly in Konfuzio Server (‘label_n_estimators’, ‘label_max_depth’, ‘label_class_weight’, ‘label_random_state’, ‘label_set_n_estimators’, ‘label_set_max_depth’).

Features are measurable pieces of data of the Annotation. By default, a combination of features is used that includes features built from the text of the Annotation (‘string_features’), features built from the position of the Annotation in the Document (‘spatial_features’) and features from the Spans created by a WhitespaceTokenizer on the left or on the right of the Annotation (‘n_nearest_left’, ‘n_nearest_right’, ‘n_nearest_across_lines). It is possible to exclude any of them (‘spatial_features’, ‘string_features’, ‘n_nearest_left’, ‘n_nearest_right’) or to specify the number of Spans created by a WhitespaceTokenizer to consider (‘n_nearest_left’, ‘n_nearest_right’).

While extracting, the Label Set classifier takes the predictions from the Label classifier as input. The Label Set classifier groups them into Annotation sets.

check_is_ready()

Check if Tokenizer is set and the classifiers set and trained.

evaluate_clf(use_training_docs: bool = False) konfuzio_sdk.evaluate.ExtractionEvaluation

Evaluate the Label classifier.

evaluate_full(strict: bool = True, use_training_docs: bool = False, use_view_annotations: bool = True) konfuzio_sdk.evaluate.ExtractionEvaluation

Evaluate the full pipeline on the pipeline’s Test Documents.

Parameters
  • strict – List of documents to extract features from.

  • use_training_docs – Bool for whether to evaluate on the training documents instead of testing documents.

Returns

Evaluation object.

evaluate_label_set_clf(use_training_docs: bool = False) konfuzio_sdk.evaluate.ExtractionEvaluation

Evaluate the LabelSet classifier.

evaluate_tokenizer(use_training_docs: bool = False) konfuzio_sdk.evaluate.ExtractionEvaluation

Evaluate the tokenizer.

extract(document: konfuzio_sdk.data.Document) konfuzio_sdk.data.Document

Infer information from a given Document.

Parameters

document – Document object

Returns

Document with predicted labels

Raises

AttributeError: When missing a Tokenizer NotFittedError: When CLF is not fitted

extract_from_df(df: pandas.core.frame.DataFrame, inference_document: konfuzio_sdk.data.Document) konfuzio_sdk.data.Document

Predict Labels from features.

feature_function(documents: List[konfuzio_sdk.data.Document], no_label_limit: Optional[Union[int, float]] = None, retokenize: Optional[bool] = None, require_revised_annotations: bool = False) Tuple[List[pandas.core.frame.DataFrame], list]

Calculate features per Span of Annotations.

Parameters
  • documents – List of documents to extract features from.

  • no_label_limit – Int or Float to limit number of new annotations to create during tokenization.

  • retokenize – Bool for whether to recreate annotations from scratch or use already existing annotations.

Returns

Dataframe of features and list of feature names.

features(document: konfuzio_sdk.data.Document)

Calculate features using the best working default values that can be overwritten with self values.

filter_dataframe(df: pandas.core.frame.DataFrame) pandas.core.frame.DataFrame

Filter dataframe rows accordingly with the confidence value.

Rows (extractions) where the accuracy value is below the threshold defined for the label are removed.

Parameters

df – Dataframe with extraction results

Returns

Filtered dataframe

filter_low_confidence_extractions(result: Dict) Dict

Remove extractions with confidence below the threshold defined for the respective label.

The input is a dictionary where the values can be: - dataframe - dictionary where the values are dataframes - list of dictionaries where the values are dataframes

Parameters

result – Extraction results

Returns

Filtered dictionary.

fit() sklearn.ensemble._forest.RandomForestClassifier

Given training data and the feature list this function returns the trained regression model.

label_train_document(virtual_document: konfuzio_sdk.data.Document, original_document: konfuzio_sdk.data.Document)

Assign labels to Annotations in newly tokenized virtual training document.

merge_vertical(document: konfuzio_sdk.data.Document, only_multiline_labels=True)

Merge Annotations with the same Label.

See more details at https://dev.konfuzio.com/sdk/explanations.html#vertical-merge

Parameters
  • document – Document whose Annotations should be merged vertically

  • only_multiline_labels – Only merge if a multiline Label Annotation is in the Category Training set

property pkl_file_path: str

Generate a path for a resulting pickle file.

property project

Get RFExtractionAI Project.

reduce_model_weight()

Remove all non-strictly necessary parameters before saving.

remove_empty_dataframes_from_extraction(result: Dict) Dict

Remove empty dataframes from the result of an Extraction AI.

The input is a dictionary where the values can be: - dataframe - dictionary where the values are dataframes - list of dictionaries where the values are dataframes

separate_labels(res_dict: Dict) Dict

Undo the renaming of the labels.

In this way we have the output of the extraction in the correct format.

property temp_pkl_file_path: str

Generate a path for temporary pickle file.

Load Saved AI Model

konfuzio_sdk.trainer.information_extraction.load_model(pickle_path: str, max_ram: Optional[str] = None)

Load a pkl file.

Parameters

pickle_path (str) – Path to the pickled model.

Raises
  • FileNotFoundError – If the path is invalid.

  • OSError – When the data is corrupted or invalid and cannot be loaded.

  • TypeError – When the loaded pickle isn’t recognized as a Konfuzio AI model.

Returns

Extraction AI model.

Categorization AI

[source]

Implements a Categorization Model.

Abstract Categorization AI

class konfuzio_sdk.trainer.document_categorization.AbstractCategorizationAI(categories: List[konfuzio_sdk.data.Category], *args, **kwargs)

Abstract definition of a CategorizationAI.

categorize(document: konfuzio_sdk.data.Document, recategorize: bool = False, inplace: bool = False) konfuzio_sdk.data.Document

Run categorization on a Document.

Parameters
  • document – Input Document

  • recategorize – If the input Document is already categorized, the already present Category is used unless

this flag is True

Parameters

inplace – Option to categorize the provided Document in place, which would assign the Category attribute

Returns

Copy of the input Document with added CategoryAnnotation information

check_is_ready()

Check if Categorization AI instance is ready for inference.

evaluate(use_training_docs: bool = False) konfuzio_sdk.evaluate.CategorizationEvaluation

Evaluate the full Categorization pipeline on the pipeline’s Test Documents.

Parameters

use_training_docs – Bool for whether to evaluate on the Training Documents instead of Test Documents.

Returns

Evaluation object.

abstract fit() None

Train the Categorization AI.

static has_compatible_interface(other)

Validate that an instance of a Categorization AI implements the same interface as AbstractCategorizationAI.

A Categorization AI should implement methods with the same signature as: - AbstractCategorizationAI.__init__ - AbstractCategorizationAI.fit - AbstractCategorizationAI._categorize_page - AbstractCategorizationAI.check_is_ready

Parameters

other – An instance of a Categorization AI to compare with.

name_lower()

Convert class name to machine-readable name.

abstract save(output_dir: str, include_konfuzio=True)

Save the model to disk.

Name-based Categorization AI

class konfuzio_sdk.trainer.document_categorization.NameBasedCategorizationAI(categories: List[konfuzio_sdk.data.Category], *args, **kwargs)

A simple, non-trainable model that predicts a Category for a given Document based on a predefined rule.

It checks for whether the name of the Category is present in the input Document (case insensitive; also see Category.fallback_name). This can be an effective fallback logic to categorize Documents when no Categorization AI is available.

fit() None

Use as placeholder Function.

save(output_dir: str, include_konfuzio=True)

Use as placeholder Function.

Model-based Categorization AI

class konfuzio_sdk.trainer.document_categorization.CategorizationAI(categories: List[konfuzio_sdk.data.Category], use_cuda: bool = False, *args, **kwargs)

A trainable AI that predicts a Category for each Page of a given Document.

build_document_classifier_iterator(documents, transforms: <module 'torchvision.transforms' from '/opt/hostedtoolcache/Python/3.8.16/x64/lib/python3.8/site-packages/torchvision/transforms/__init__.py'>, use_image: bool, use_text: bool, shuffle: bool, batch_size: int, max_len: int, device: torch.device = 'cpu') torch.utils.data.dataloader.DataLoader

Prepare the data necessary for the document classifier, and build the iterators for the data list.

For each document we split into pages and from each page we take:
  • the path to an image of the page

  • the tokenized and numericalized text on the page

  • the label (category) of the page

  • the id of the document

  • the page number

build_preprocessing_pipeline(use_image: bool, image_augmentation=None, image_preprocessing=None) None

Set up the pre-processing and data augmentation when necessary.

build_template_category_vocab() konfuzio_sdk.tokenizer.base.Vocab

Build a vocabulary over the Categories.

build_text_vocab(min_freq: int = 1, max_size: Optional[int] = None) konfuzio_sdk.tokenizer.base.Vocab

Build a vocabulary over the document text.

fit(document_training_config=None, **kwargs) Dict[str, List[float]]

Fit the CategorizationAI classifier.

save(path: Union[None, str] = None) str

Save only the necessary parts of the model for extraction/inference.

Saves: - tokenizer (needed to ensure we tokenize inference examples in the same way that they are trained) - transforms (to ensure we transform/pre-process images in the same way as training) - vocabs (to ensure the tokens/labels are mapped to the same integers as training) - configs (to ensure we load the same models used in training) - state_dicts (the classifier parameters achieved through training)

Build a Model-based Categorization AI

konfuzio_sdk.trainer.document_categorization.build_categorization_ai_pipeline(categories: List[konfuzio_sdk.data.Category], documents: List[konfuzio_sdk.data.Document], test_documents: List[konfuzio_sdk.data.Document], tokenizer: Optional[konfuzio_sdk.tokenizer.base.AbstractTokenizer] = None, image_model: Optional[konfuzio_sdk.trainer.document_categorization.ImageModel] = None, text_model: Optional[konfuzio_sdk.trainer.document_categorization.TextModel] = None, optimizer: konfuzio_sdk.trainer.document_categorization.Optimizer = Optimizer.Adam) konfuzio_sdk.trainer.document_categorization.CategorizationAI

Build a Categorization AI neural network by choosing an ImageModel and a TextModel.

See an in-depth tutorial at https://dev.konfuzio.com/sdk/tutorials.html#model-based-categorization-ai

NBOW Model

class konfuzio_sdk.trainer.document_categorization.NBOW(input_dim: int, emb_dim: int = 64, dropout_rate: float = 0.0, **kwargs)

The neural bag-of-words (NBOW) model is the simplest of models, it passes each token through an embedding layer.

As shown in the fastText paper (https://arxiv.org/abs/1607.01759) this model is still able to achieve comparable performance to some deep learning models whilst being considerably faster.

One downside of this model is that tokens are embedded without regards to the surrounding context in which they appear, e.g. the embedding for “May” in the two sentences “May I speak to you?” and “I am leaving on the 1st of May” are identical, even though they have different semantics.

Parameters
  • emb_dim – The dimensions of the embedding vector.

  • dropout_rate – The amount of dropout applied to the embedding vectors.

NBOW Self Attention Model

class konfuzio_sdk.trainer.document_categorization.NBOWSelfAttention(input_dim: int, emb_dim: int = 64, n_heads: int = 8, dropout_rate: float = 0.0, **kwargs)

This is an NBOW model with a multi-headed self-attention layer, which is added after the embedding layer.

See details at https://arxiv.org/abs/1706.03762. The self-attention layer effectively contextualizes the output as now each hidden state is calculated from the embedding vector of a token and the embedding vector of all other tokens within the sequence.

Parameters
  • emb_dim – The dimensions of the embedding vector.

  • dropout_rate – The amount of dropout applied to the embedding vectors.

  • n_heads – The number of attention heads to use in the multi-headed self-attention layer. Note that n_heads

must be a factor of emb_dim, i.e. emb_dim % n_heads == 0.

LSTM Model

class konfuzio_sdk.trainer.document_categorization.LSTM(input_dim: int, emb_dim: int = 64, hid_dim: int = 256, n_layers: int = 2, bidirectional: bool = True, dropout_rate: float = 0.0, **kwargs)

The LSTM (long short-term memory) is a variant of a RNN (recurrent neural network).

It feeds the input tokens through an embedding layer and then processes them sequentially with the LSTM, outputting a hidden state for each token. If the LSTM is bi-directional then it trains a forward and backward LSTM per layer and concatenates the forward and backward hidden states for each token.

Parameters
  • emb_dim – The dimensions of the embedding vector.

  • hid_dim – The dimensions of the hidden states.

  • n_layers – How many LSTM layers to use.

  • bidirectional – If the LSTM should be bidirectional.

  • dropout_rate – The amount of dropout applied to the embedding vectors and between LSTM layers if

n_layers > 1.

BERT Model

class konfuzio_sdk.trainer.document_categorization.BERT(input_dim: int, name: str = 'bert-base-german-cased', freeze: bool = True, **kwargs)

Wraps around pre-trained BERT-type models from the HuggingFace library.

BERT (bi-directional encoder representations from Transformers) is a family of large Transformer models. The available BERT variants are all pre-trained models provided by the transformers library. It is usually infeasible to train a BERT model from scratch due to the significant amount of computation required. However, the pre-trained models can be easily fine-tuned on desired data.

The BERT variants, i.e. name arguments, that are covered by internal tests are:
  • bert-base-german-cased

  • bert-base-german-dbmdz-cased

  • bert-base-german-dbmdz-uncased

  • distilbert-base-german-cased

In theory, all variants beginning with bert-base-* and distilbert-* should work out of the box. Other BERT variants come with no guarantees.

Parameters
  • name – The name of the pre-trained BERT variant to use.

  • freeze – Should the BERT model be frozen, i.e. the pre-trained parameters are not updated.

get_max_length()

Get the maximum length of a sequence that can be passed to the BERT module.

VGG Model

class konfuzio_sdk.trainer.document_categorization.VGG(name: str = 'vgg11', pretrained: bool = True, freeze: bool = True, **kwargs)

The VGG family of models are image classification models designed for the ImageNet.

They are usually used as a baseline in image classification tasks, however are considerably larger - in terms of the number of parameters - than modern architectures.

Available variants are: vgg11, vgg13, vgg16, vgg19, vgg11_bn, vgg13_bn, vgg16_bn, vgg19_bn. The number generally indicates the number of layers in the model, higher does not always mean better. The _bn suffix means that the VGG model uses Batch Normalization layers, this generally leads to better results.

The pre-trained weights are taken from the [torchvision](https://github.com/pytorch/vision) library and are weights from a model that has been trained as an image classifier on ImageNet. Ideally, this means the images should be 3-channel color images that are at least 224x224 pixels and should be normalized.

Parameters
  • name – The name of the VGG variant to use

  • pretrained – If pre-trained weights for the VGG variant should be used

  • freeze – If the parameters of the VGG variant should be frozen

EfficientNet Model

class konfuzio_sdk.trainer.document_categorization.EfficientNet(name: str = 'efficientnet_b0', pretrained: bool = True, freeze: bool = True, **kwargs)

EfficientNet is a family of convolutional neural network based models that are designed to be more efficient.

The efficiency comes in terms of the number of parameters and FLOPS, compared to previous computer vision models whilst maintaining equivalent image classification performance.

Available variants are: efficientnet_b0, efficientnet_b1, …, efficienet_b7. With b0 having the least amount of parameters and b7 having the most.

The pre-trained weights are taken from the timm library and have been trained on ImageNet, thus the same tips, i.e. normalization, that apply to the VGG models also apply here.

Parameters
  • name – The name of the EfficientNet variant to use

  • pretrained – If pre-trained weights for the EfficientNet variant should be used

  • freeze – If the parameters of the EfficientNet variant should be frozen

get_n_features() int

Calculate number of output features based on given model.

Multimodal Concatenation

class konfuzio_sdk.trainer.document_categorization.MultimodalConcatenate(n_image_features: int, n_text_features: int, hid_dim: int = 256, output_dim: Optional[int] = None, **kwargs)

Defines how the image and text features are combined in order to yield a categorization prediction.

File Splitting AI

[source]

Process Documents that consist of several files and propose splitting them into the Sub-Documents accordingly.

A Context Aware File Splitting Model uses a simple hands-on logic based on scanning Category’s Documents and finding strings exclusive for first Pages of all Documents within the Category. Upon predicting whether a Page is a potential splitting point (meaning whether it is first or not), we compare Page’s contents to these exclusive first-page strings; if there is occurrence of at least one such string, we mark a Page to be first (thus meaning it is a splitting point). An instance of the Context Aware File Splitting Model can be used to initially build a File Splitting pipeline and can later be replaced with more complex solutions.

A Context Aware File Splitting Model instance can be used with an interface provided by Splitting AI – this class accepts a whole Document instead of a single Page and proposes splitting points or splits the original Documents.

A Multimodal File Splitting Model is a model that uses an approach that takes both visual and textual parts of the Pages and processes them independently via the combined VGG19 architecture (simplified) and LegalBERT, passing the resulting outputs together to a Multi-Layered Perceptron. Model’s output is also a prediction of a Page being first or non-first.

For developing a custom File Splitting approach, we propose an abstract class.

Abstract File Splitting Model

class konfuzio_sdk.trainer.file_splitting.AbstractFileSplittingModel(categories: List[konfuzio_sdk.data.Category], *args, **kwargs)

Abstract class for the File Splitting model.

abstract fit(*args, **kwargs)

Fit the custom model on the training Documents.

static has_compatible_interface(other) bool

Validate that an instance of a File Splitting Model implements the same interface as AbstractFileSplittingModel.

A File Splitting Model should implement methods with the same signature as: - AbstractFileSplittingModel.__init__ - AbstractFileSplittingModel.predict - AbstractFileSplittingModel.fit - AbstractFileSplittingModel.check_is_ready

Parameters

other – An instance of a File Splitting Model to compare with.

property pkl_file_path: str

Generate a path for a resulting pickle file.

Returns

A string with the path.

abstract predict(page: konfuzio_sdk.data.Page) konfuzio_sdk.data.Page

Take a Page as an input and reassign is_first_page attribute’s value if necessary.

Parameters

page (Page) – A Page to label first or non-first.

Returns

Page.

property temp_pkl_file_path: str

Generate a path for temporary pickle file.

Returns

A string with the path.

Context Aware File Splitting Model

class konfuzio_sdk.trainer.file_splitting.ContextAwareFileSplittingModel(categories: List[konfuzio_sdk.data.Category], tokenizer, *args, **kwargs)

A File Splitting Model that uses a context-aware logic.

Context-aware logic implies a rule-based approach that looks for common strings between the first Pages of all Category’s Documents.

check_is_ready()

Check File Splitting Model is ready for inference.

Raises
  • AttributeError – When no Tokenizer or no Categories were passed.

  • ValueError – When no Categories have _exclusive_first_page_strings.

fit(allow_empty_categories: bool = False, *args, **kwargs)

Gather the strings exclusive for first Pages in a given stream of Documents.

Exclusive means that each of these strings appear only on first Pages of Documents within a Category.

Parameters

allow_empty_categories – To allow returning empty list for a Category if no exclusive first-page strings

were found during fitting (which means prediction would be impossible for a Category). :type allow_empty_categories: bool :raises ValueError: When allow_empty_categories is False and no exclusive first-page strings were found for at least one Category.

>>> from konfuzio_sdk.tokenizer.regex import ConnectedTextTokenizer
>>> from konfuzio_sdk.data import Project
>>> project = Project(id_=46)
>>> tokenizer = ConnectedTextTokenizer()
>>> model = ContextAwareFileSplittingModel(categories=project.categories, tokenizer=tokenizer).fit()
predict(page: konfuzio_sdk.data.Page) konfuzio_sdk.data.Page

Predict a Page as first or non-first.

Parameters

page (Page) – A Page to receive first or non-first label.

Returns

A Page with a newly predicted is_first_page attribute.

>>> from konfuzio_sdk.tokenizer.regex import ConnectedTextTokenizer
>>> from konfuzio_sdk.data import Project
>>> project = Project(id_=46)
>>> tokenizer = ConnectedTextTokenizer()
>>> test_document = project.get_document_by_id(44865)
>>> model = ContextAwareFileSplittingModel(categories=project.categories, tokenizer=tokenizer)
>>> model.fit()
>>> model.check_is_ready()
>>> model.predict(model.tokenizer.tokenize(test_document).pages()[0]).is_first_page
True

Multimodal File Splitting Model

class konfuzio_sdk.trainer.file_splitting.MultimodalFileSplittingModel(categories: List[konfuzio_sdk.data.Category], text_processing_model: str = 'nlpaueb/legal-bert-base-uncased', scale: int = 2, *args, **kwargs)

Split a multi-Document file into a list of shorter Documents based on model’s prediction.

We use an approach suggested by Guha et al.(2022) that incorporates steps for accepting separate visual and textual inputs and processing them independently via the VGG19 architecture and LegalBERT model which is essentially a BERT-type architecture trained on domain-specific data, and passing the resulting outputs together to a Multi-Layered Perceptron.

Guha, A., Alahmadi, A., Samanta, D., Khan, M. Z., & Alahmadi, A. H. (2022). A Multi-Modal Approach to Digital Document Stream Segmentation for Title Insurance Domain. https://ieeexplore.ieee.org/stamp/stamp.jsp?arnumber=9684474

check_is_ready()

Check if Multimodal File Splitting Model instance is ready for inference.

fit(epochs: int = 10, use_gpu: bool = False, *args, **kwargs)

Process the train and test data, initialize and fit the model.

Parameters
  • epochs (int) – A number of epochs to train a model on.

  • use_gpu (bool) – Run training on GPU if available.

predict(page: konfuzio_sdk.data.Page, use_gpu: bool = False) konfuzio_sdk.data.Page

Run prediction with the trained model.

Parameters
  • page (Page) – A Page to be predicted as first or non-first.

  • use_gpu (bool) – Run prediction on GPU if available.

Returns

A Page with possible changes in is_first_page attribute value.

reduce_model_weight()

Remove all non-strictly necessary parameters before saving.

Splitting AI

class konfuzio_sdk.trainer.file_splitting.SplittingAI(model)

Split a given Document and return a list of resulting shorter Documents.

evaluate_full(use_training_docs: bool = False) konfuzio_sdk.evaluate.FileSplittingEvaluation

Evaluate the Splitting AI’s performance.

Parameters

use_training_docs – If enabled, runs evaluation on the training data to define its quality; if disabled,

runs evaluation on the test data. :type use_training_docs: bool :return: Evaluation information for the model.

propose_split_documents(document: konfuzio_sdk.data.Document, return_pages: bool = False, inplace: bool = False) List[konfuzio_sdk.data.Document]

Propose a set of resulting Documents from a single Document.

Parameters
  • document (Document) – An input Document to be split.

  • inplace (bool) – Whether changes are applied to the input Document, changing it, or to a deepcopy of it.

  • return_pages – A flag to enable returning a copy of an old Document with Pages marked .is_first_page on

splitting points instead of a set of Sub-Documents. :type return_pages: bool :return: A list of suggested new Sub-Documents built from the original Document or a list with a Document with Pages marked .is_first_page on splitting points.

AI Evaluation

[source]

Extraction AI Evaluation

Categorization AI Evaluation

class konfuzio_sdk.evaluate.CategorizationEvaluation(categories: List[konfuzio_sdk.data.Category], documents: List[Tuple[konfuzio_sdk.data.Document, konfuzio_sdk.data.Document]])

Calculated evaluation measures for the classification task of Document categorization.

property actual_classes: List[int]

List of ground truth Category IDs.

calculate()

Calculate and update the data stored within this Evaluation.

property category_ids: List[int]

List of Category IDs as class labels.

property category_names: List[str]

List of Category names as class names.

confusion_matrix() pandas.core.frame.DataFrame

Confusion matrix.

f1(category: Optional[konfuzio_sdk.data.Category]) Optional[float]

Calculate the global F1 Score or filter it by one Category.

fn(category: Optional[konfuzio_sdk.data.Category] = None) int

Return the False Negatives of all Documents.

fp(category: Optional[konfuzio_sdk.data.Category] = None) int

Return the False Positives of all Documents.

get_evaluation_data(search: Optional[konfuzio_sdk.data.Category] = None, allow_zero: bool = True) konfuzio_sdk.evaluate.EvaluationCalculator

Get precision, recall, f1, based on TP, TN, FP, FN.

Parameters
  • search (Category) – A Category to filter for, or None for getting global evaluation results.

  • allow_zero – If true, will calculate None for precision and recall when the straightforward application

of the formula would otherwise result in 0/0. Raises ZeroDivisionError otherwise. :type allow_zero: bool

precision(category: Optional[konfuzio_sdk.data.Category]) Optional[float]

Calculate the global Precision or filter it by one Category.

property predicted_classes: List[int]

List of predicted Category IDs.

recall(category: Optional[konfuzio_sdk.data.Category]) Optional[float]

Calculate the global Recall or filter it by one Category.

tn(category: Optional[konfuzio_sdk.data.Category] = None) int

Return the True Negatives of all Documents.

tp(category: Optional[konfuzio_sdk.data.Category] = None) int

Return the True Positives of all Documents.

File Splitting AI Evaluation

class konfuzio_sdk.evaluate.FileSplittingEvaluation(ground_truth_documents: List[konfuzio_sdk.data.Document], prediction_documents: List[konfuzio_sdk.data.Document])

Evaluate the quality of the filesplitting logic.

calculate()

Calculate metrics for the filesplitting logic.

calculate_metrics_by_category()

Calculate metrics by Category independently.

f1(search: Optional[konfuzio_sdk.data.Category] = None) float

Return F1-measure.

Parameters

search (Category) – display F1 measure within a certain Category.

Raises

KeyError – When the Category in search is not present in the Project from which the Documents are.

fn(search: Optional[konfuzio_sdk.data.Category] = None) int

Return first Pages incorrectly predicted as non-first.

Parameters

search (Category) – display false negatives within a certain Category.

Raises

KeyError – When the Category in search is not present in the Project from which the Documents are.

fp(search: Optional[konfuzio_sdk.data.Category] = None) int

Return non-first Pages incorrectly predicted as first.

Parameters

search (Category) – display false positives within a certain Category.

Raises

KeyError – When the Category in search is not present in the Project from which the Documents are.

get_evaluation_data(search: Optional[konfuzio_sdk.data.Category] = None, allow_zero: bool = True) konfuzio_sdk.evaluate.EvaluationCalculator

Get precision, recall, f1, based on TP, TN, FP, FN.

Parameters
  • search (Category) – display true positives within a certain Category.

  • allow_zero – If true, will calculate None for precision and recall when the straightforward application

of the formula would otherwise result in 0/0. Raises ZeroDivisionError otherwise. :type allow_zero: bool

precision(search: Optional[konfuzio_sdk.data.Category] = None) float

Return precision.

Parameters

search (Category) – display precision within a certain Category.

Raises

KeyError – When the Category in search is not present in the Project from which the Documents are.

recall(search: Optional[konfuzio_sdk.data.Category] = None) float

Return recall.

Parameters

search (Category) – display recall within a certain Category.

Raises

KeyError – When the Category in search is not present in the Project from which the Documents are.

tn(search: Optional[konfuzio_sdk.data.Category] = None) int

Return non-first Pages predicted as non-first.

Parameters

search (Category) – display true negatives within a certain Category.

Raises

KeyError – When the Category in search is not present in the Project from which the Documents are.

tp(search: Optional[konfuzio_sdk.data.Category] = None) int

Return correctly predicted first Pages.

Parameters

search (Category) – display true positives within a certain Category.

Raises

KeyError – When the Category in search is not present in the Project from which the Documents are.