API Reference¶

Reference guides are technical descriptions of the machinery and how to operate it. Reference material is information-oriented.

Data¶

[source]

Handle data from the API.

Span¶

class konfuzio_sdk.data.Span(start_offset: int, end_offset: int, annotation: Optional[konfuzio_sdk.data.Annotation] = None, document: Optional[konfuzio_sdk.data.Document] = None, strict_validation: bool = True)

A Span is a sequence of characters or whitespaces without line break.

For more details see https://dev.konfuzio.com/sdk/explanations.html#span-concept

bbox() → konfuzio_sdk.data.Bbox: Calculate the bounding box of a text sequence.

bbox_dict() → Dict: Return Span Bbox info as a serializable Dict format for external integration with the Konfuzio Server.

eval_dict(): Return any information needed to evaluate the Span.

static get_sentence_from_spans(spans: Iterable[konfuzio_sdk.data.Span], punctuation=None) → List[List[konfuzio_sdk.data.Span]]: Return a list of Spans corresponding to Sentences separated by Punctuation.

property line_index: int: Return index of the line of the Span.

property normalized: Normalize the offset string.

property offset_string: Optional[str]: Calculate the offset string of a Span.

property page: konfuzio_sdk.data.Page: Return Page of Span.

regex(): Suggest a Regex for the offset string.

Bbox¶

class konfuzio_sdk.data.Bbox(x0: int, x1: int, y0: int, y1: int, page: konfuzio_sdk.data.Page, validation=BboxValidationTypes.ALLOW_ZERO_SIZE)

A bounding box relates to an area of a Document Page.

For more details see https://dev.konfuzio.com/sdk/explanations.html#bbox-concept

What consistutes a valid Bbox changes depending on the value of the validation param. If ALLOW_ZERO_SIZE (default), it allows bounding boxes to have zero width or height. This option is available for compatibility reasons since some OCR engines can sometimes return character level bboxes with zero width or height. If STRICT, it doesn’t allow zero size bboxes. If DISABLED, it allows bboxes that have negative size, or coordinates beyond the Page bounds. For the default behaviour see https://dev.konfuzio.com/sdk/tutorials/data_validation/index.html

Parameters: validation – One of ALLOW_ZERO_SIZE (default), STRICT, or DISABLED.

property area: Return area covered by the Bbox.

check_overlap(bbox: Union[konfuzio_sdk.data.Bbox, Dict]) → bool: Verify if there’s overlap between two Bboxes.

property document: konfuzio_sdk.data.Document: Get the Document the Bbox belongs to.

classmethod from_image_size(x0, x1, y0, y1, page: konfuzio_sdk.data.Page) → konfuzio_sdk.data.Bbox

Create a Bbox from image dimensions, based on the scaling of character Bboxes within the Document.

This method computes the coordinates of the bottom-left and top-right corners in a coordinate system where the y-axis is oriented from bottom to top, the x-axis is from left to right, and the scale is based on the page.

Parameters

x0 – The x-coordinate of the top-left corner in an image-scaled system.
x1 – The x-coordinate of the bottom-right corner in an image-scaled system.
y0 – The y-coordinate of the top-left corner in an image-scaled system.
y1 – The y-coordinate of the bottom-right corner in an image-scaled system.
page – The Page object for reference in scaling.

Returns

A Bbox object with rescaled dimensions.

property top: Calculate the distance to the top of the Page.

property x0_image: Get the x0 coordinate in the context of the Page image.

property x1_image: Get the x1 coordinate in the context of the Page image.

property y0_image: Get the y0 coordinate in the context of the Page image, in a top-down coordinate system.

property y1_image: Get the y1 coordinate in the context of the Page image, in a top-down coordinate system.

Annotation¶

class konfuzio_sdk.data.Annotation(document: konfuzio_sdk.data.Document, annotation_set_id: Optional[int] = None, annotation_set: Optional[konfuzio_sdk.data.AnnotationSet] = None, label: Optional[Union[int, konfuzio_sdk.data.Label]] = None, label_set_id: Optional[int] = None, label_set: Union[None, konfuzio_sdk.data.LabelSet] = None, is_correct: bool = False, revised: bool = False, normalized=None, id_: Optional[int] = None, spans=None, accuracy: Optional[float] = None, confidence: Optional[float] = None, created_by: Optional[int] = None, revised_by: Optional[int] = None, translated_string: Optional[str] = None, custom_offset_string: bool = False, offset_string: Optional[str] = None, *args, **kwargs)

Hold information that a Label, Label Set and Annotation Set has been assigned to and combines Spans.

For more details see https://dev.konfuzio.com/sdk/explanations.html#annotation-concept

add_span(span: konfuzio_sdk.data.Span): Add a Span to an Annotation incl. a duplicate check per Annotation.

bbox() → konfuzio_sdk.data.Bbox: Get Bbox encompassing all Annotation Spans.

property bboxes: List[Dict]

Return the Bbox information for all Spans in serialized format.

This is useful for external integration (e.g. Konfuzio Server).”

delete(delete_online: bool = True) → None

Delete Annotation.

Parameters: delete_online – Whether the Annotation is deleted online or only locally.

property end_offset: int: Legacy: One Annotation can have multiple end offsets.

property eval_dict: List[dict]: Calculate the Span information to evaluate the Annotation.

get_link(): Get link to the Annotation in the SmartView.

property is_multiline: int: Calculate if Annotation spans multiple lines of text.

property label_set: konfuzio_sdk.data.LabelSet: Return Label Set of Annotation.

lose_weight(): Delete data of the instance.

property normalize: str: Provide one normalized offset string due to legacy.

property offset_string: List[str]: View the string representation of the Annotation.

property page: konfuzio_sdk.data.Page: Return Page of Annotation.

regex(): Return regex of this Annotation.

regex_annotation_generator(regex_list) → List[konfuzio_sdk.data.Span]

Build Spans without Labels by regexes.

Returns: Return sorted list of Spans by start_offset

save(label_set_id=None, annotation_set_id=None, document_annotations: Optional[list] = None) → bool

Save Annotation online.

If there is already an Annotation in the same place as the current one, we will not be able to save the current annotation.

In that case, we get the id_ of the original one to be able to track it. The verification of the duplicates is done by checking if the offsets and Label match with any Annotations online. To be sure that we are comparing with the information online, we need to have the Document updated. The update can be done after the request (per annotation) or the updated Annotations can be passed as input of the function (advisable when dealing with big Documents or Documents with many Annotations).

Specify label_set_id if you want to create an Annotation belonging to a new Annotation Set. Specify annotation_set_id if you want to add an Annotation to an existing Annotation Set. Do not specify both of them.

Parameters: document_annotations – Annotations in the Document (list)
Returns: True if new Annotation was created

property spans: List[konfuzio_sdk.data.Span]: Return default entry to get all Spans of the Annotation.

property start_offset: int: Legacy: One Annotation can have multiple start offsets.

token_append(new_regex, regex_quality: int): Append token if it is not a duplicate.

tokens() → List[str]: Create a list of potential tokens based on Spans of this Annotation.

Annotation Set¶

class konfuzio_sdk.data.AnnotationSet(document, label_set: konfuzio_sdk.data.LabelSet, id_: Optional[int] = None, **kwargs)

An Annotation Set is a group of Annotations. The Labels of those Annotations refer to the same Label Set.

For more details see https://dev.konfuzio.com/sdk/explanations.html#annotation-set-concept

annotations(use_correct: bool = True, ignore_below_threshold: bool = False): All Annotations currently in this Annotation Set.

property end_line_index: Optional[int]: Calculate ending line of this Annotation Set.

property end_offset: Optional[int]: Calculate the end based on all Annotations above detection threshold currently in this AnnotationSet.

property is_default: bool: Check if AnnotationSet is the default AnnotationSet of the Document.

property start_line_index: Optional[int]: Calculate starting line of this Annotation Set.

property start_offset: Optional[int]: Calculate the earliest start based on all Annotations above detection threshold in this AnnotationSet.

Label¶

class konfuzio_sdk.data.Label(project: konfuzio_sdk.data.Project, id_: Optional[int] = None, text: Optional[str] = None, get_data_type_display: str = 'Text', text_clean: Optional[str] = None, description: Optional[str] = None, label_sets=None, has_multiple_top_candidates: bool = False, threshold: float = 0.1, *initial_data, **kwargs)

Group Annotations across Label Sets.

For more details see https://dev.konfuzio.com/sdk/explanations.html#label-concept

add_label_set(label_set: konfuzio_sdk.data.LabelSet)

Add Label Set to label, if it does not exist.

Parameters: label_set – Label Set to add

annotations(categories: List[konfuzio_sdk.data.Category], use_correct=True, ignore_below_threshold=False) → List[konfuzio_sdk.data.Annotation]: Return related Annotations. Consider that one Label can be used across Label Sets in multiple Categories.

base_regex(category: konfuzio_sdk.data.Category, annotations: Optional[List[konfuzio_sdk.data.Annotation]] = None) → str: Find the best combination of regex in the list of all regex proposed by Annotations.

evaluate_regex(regex, category: konfuzio_sdk.data.Category, annotations: Optional[List[konfuzio_sdk.data.Annotation]] = None, regex_quality=0)

Evaluate a regex on Categories.

Type of regex allows you to group regex by generality

Example:: Three Annotations about the birthdate in two Documents and one regex to be evaluated 1.doc: “My was born on the 12th of December 1980, you could also say 12.12.1980.” (2 Annotations) 2.doc: “My was born on 12.06.1997.” (1 Annotations) regex: dd.dd.dddd (without escaped characters for easier reading) stats: - total_correct_findings: 2 - correct_label_annotations: 3 - total_findings: 2 –> precision 100 % - num_docs_matched: 2 - Project.documents: 2 –> Document recall 100%

find_regex(category: konfuzio_sdk.data.Category, max_findings_per_page=100) → List[str]: Find the best combination of regex for Label with before and after context.

get_probable_outliers(categories: List[konfuzio_sdk.data.Category], regex_search: bool = True, regex_worst_percentage: float = 0.1, confidence_search: bool = True, evaluation_data=None, normalization_search: bool = True) → List[konfuzio_sdk.data.Annotation]

Get a list of Annotations that are outliers.

Outliers are determined by either of three logics or a combination of them applied: found by the worst regex, have the lowest confidence and/or are not normalizeable by the data type of a given Label.

Parameters

categories (List[Category]) – Categories under which the search is done.
regex_search (bool) – Enable search by top worst regexes.
regex_worst_percentage (float) – A % of Annotations returned by the regexes.
confidence_search (bool) – Enable search by the lowest-confidence Annotations.
normalization_search (bool) – Enable search by normalizing Annotations by the Label’s data type.

Raises

ValueError – When all search options are disabled.

get_probable_outliers_by_confidence(evaluation_data, confidence: float = 0.5) → List[konfuzio_sdk.data.Annotation]

Get a list of Annotations with the lowest confidence.

A method iterates over the list of Categories, returning the top N Annotations with the lowest confidence score.

Parameters

evaluation_data (ExtractionEvaluation instance) – An instance of the ExtractionEvaluation class that contains predicted confidence scores.
confidence (float) – A level of confidence below which the Annotations are returned.

get_probable_outliers_by_normalization(categories: List[konfuzio_sdk.data.Category]) → List[konfuzio_sdk.data.Annotation]

Get a list of Annotations that do not pass normalization by the data type.

A method iterates over the list of Categories, returning the Annotations that do not fit into the data type of a Label (= have None returned in an attempt of the normalization by the Label’s data type).

Parameters: categories (List[Category]) – Categories under which the search is done.

get_probable_outliers_by_regex(categories: List[konfuzio_sdk.data.Category], use_test_docs: bool = False, top_worst_percentage: float = 0.1) → List[konfuzio_sdk.data.Annotation]

Get a list of Annotations that are identified by the least precise regular expressions.

This method iterates over the list of Categories and Annotations within each Category, collecting all the regexes associated with them. It then evaluates these regexes and collects the top worst ones (i.e., those with the least True Positives). For each of these top worst regexes, it returns the Annotations found by them but not by the best regex for that label, potentially identifying them as outliers.

To detect outlier Annotations with multi-Spans, the method iterates over all the multi-Span Annotations under the Label and checks each Span that was not detected by the aforementioned worst regexes. If it is not found by any other regex in the Project, the entire Annotation is considered a potential outlier.

Parameters

categories (List[Category]) – A list of Category objects under which the search is conducted.
use_test_docs (bool) – Indicates whether the evaluation of the regular expressions occurs on test Documents or training Documents.
top_worst_percentage (float) – A threshold for determining what percentage of the worst regexes’ output to return.

Returns

A list of Annotation objects identified by the least precise regular expressions.

Return type

List[Annotation]

has_multiline_annotations(categories: Optional[List[konfuzio_sdk.data.Category]] = None) → bool: Return if any Label annotations are multi-line.

lose_weight(): Delete data of the instance.

regex(categories: List[konfuzio_sdk.data.Category], update=False) → Dict: Calculate regex to be used in the Extraction AI.

spans(categories: List[konfuzio_sdk.data.Category], use_correct=True, ignore_below_threshold=False) → List[konfuzio_sdk.data.Span]: Return all Spans belonging to an Annotation of this Label.

spans_not_found_by_tokenizer(tokenizer, categories: List[konfuzio_sdk.data.Category], use_correct=False) → List[konfuzio_sdk.data.Span]: Find Label Spans that are not found by a tokenizer.

Label Set¶

class konfuzio_sdk.data.LabelSet(project, labels=None, id_: Optional[int] = None, name: Optional[str] = None, name_clean: Optional[str] = None, is_default=False, categories=None, has_multiple_annotation_sets=False, **kwargs)

A Label Set is a group of Labels.

For more details see https://dev.konfuzio.com/sdk/explanations.html#label-set-concept

add_category(category: konfuzio_sdk.data.Category)

Add Category to the Label Set, if it does not exist.

Parameters: category – Category to add to the Label Set

add_label(label)

Add Label to Label Set, if it does not exist.

Parameters: label – Label ID to be added

get_target_names(use_separate_labels: bool): Get target string name for Annotation Label classification.

Category¶

class konfuzio_sdk.data.Category(project, id_: Optional[int] = None, name: Optional[str] = None, name_clean: Optional[str] = None, *args, **kwargs)

Group Documents in a Project.

For more details see https://dev.konfuzio.com/sdk/explanations.html#category-concept

add_label_set(label_set): Add Label Set to Category.

property default_label_set: Get the default Label Set of the Category.

documents(): Filter for Documents of this Category.

exclusive_first_page_strings(tokenizer) → set

Return a set of strings exclusive for first Pages of Documents within the Category.

Parameters: tokenizer – A tokenizer to process Documents before gathering strings.

property fallback_name: str: Turn the Category name to lowercase, remove parentheses along with their contents, and trim spaces.

property labels: Return the Labels that belong to the Category and its Label Sets.

test_documents(): Filter for test Documents of this Category.

Category Annotation¶

class konfuzio_sdk.data.CategoryAnnotation(category: konfuzio_sdk.data.Category, confidence: Optional[float] = None, page: Optional[konfuzio_sdk.data.Page] = None, document: Optional[konfuzio_sdk.data.Document] = None, id_: Optional[int] = None)

Annotate the Category of a Page.

For more details see https://dev.konfuzio.com/sdk/explanations.html#category-annotation-concept

property confidence: float

Get the confidence of this Category Annotation.

If the confidence was not set, it means it was never predicted by an AI. Thus, the returned value will be 0, unless it was set by a human, in which case it defaults to 1.

Returns: Confidence between 0.0 and 1.0 included.

set_revised() → None: Set this Category Annotation as revised by human, and thus the correct one for the linked Page.

Document¶

class konfuzio_sdk.data.Document(project: konfuzio_sdk.data.Project, id_: Optional[int] = None, file_url: Optional[str] = None, status: Optional[List[Union[int, str]]] = None, data_file_name: Optional[str] = None, is_dataset: Optional[bool] = None, dataset_status: Optional[int] = None, updated_at: Optional[str] = None, assignee: Optional[int] = None, category_template: Optional[int] = None, category: Optional[konfuzio_sdk.data.Category] = None, category_confidence: Optional[float] = None, category_is_revised: bool = False, text: Optional[str] = None, bbox: Optional[dict] = None, bbox_validation_type=None, pages: Optional[list] = None, update: bool = False, copy_of_id: Optional[int] = None, *args, **kwargs)

Access the information about one Document, which is available online.

For more details see https://dev.konfuzio.com/sdk/explanations.html#document-concept

add_annotation(annotation: konfuzio_sdk.data.Annotation)

Add an Annotation to a Document.

The Annotation is only added to the Document if the data validation tests are passing for this Annotation. See https://dev.konfuzio.com/sdk/tutorials/data_validation/index.html

Parameters: annotation – Annotation to add in the Document
Returns: Input Annotation.

add_annotation_set(annotation_set: konfuzio_sdk.data.AnnotationSet): Add the Annotation Sets to the Document.

add_page(page: konfuzio_sdk.data.Page): Add a Page to a Document.

annotation_sets(label_set: Optional[konfuzio_sdk.data.LabelSet] = None) → List[konfuzio_sdk.data.AnnotationSet]

Return Annotation Sets of Documents.

Parameters: label_set – Label Set for which to filter the Annotation Sets.
Returns: Annotation Sets of Documents.

annotations(label: Optional[konfuzio_sdk.data.Label] = None, use_correct: bool = True, ignore_below_threshold: bool = False, start_offset: int = 0, end_offset: Optional[int] = None, fill: bool = False) → List[konfuzio_sdk.data.Annotation]

Filter available annotations.

Parameters

label – Label for which to filter the Annotations.
use_correct – If to filter by correct Annotations.
ignore_below_threshold – To filter out Annotations with confidence below Label prediction threshold.

Returns

Annotations in the document.

property bboxes: Dict[int, konfuzio_sdk.data.Bbox]: Use the cached bbox version.

property category: konfuzio_sdk.data.Category

Return the Category of the Document.

The Category of a Document is only defined as long as all Pages have the same Category. Otherwise, the Document should probably be split into multiple Documents with a consistent Category assignment within their Pages, or the Category for each Page should be manually revised.

property category_annotations: List[konfuzio_sdk.data.CategoryAnnotation]

Collect Category Annotations and average confidence across all Pages.

Returns: List of Category Annotations, one for each Category.

check_annotations(update_document: bool = False) → bool: Check if Annotations are valid - no duplicates and correct Category.

check_bbox() → None

Run validation checks on the Document text and bboxes.

This is run when the Document is initialized, and usually it’s not needed to be run again because a Document’s text and bboxes are not expected to change within the Konfuzio Server.

You can run this manually instead if your pipeline allows changing the text or the bbox during the lifetime of a document. Will raise ValueError if the bboxes don’t match with the text of the document, or if bboxes have invalid coordinates (outside page borders) or invalid size (negative width or height).

This check is usually slow, and it can be made faster by calling Document.set_text_bbox_hashes() right after initializing the Document, which will enable running a hash comparison during this check.

create_subdocument_from_page_range(start_page: konfuzio_sdk.data.Page, end_page: konfuzio_sdk.data.Page, include=False)

Create a shorter Document from a Page range of an initial Document.

Parameters

start_page (Page) – A Page that the new sub-Document starts with.
end_page (Page) – A Page that the new sub-Document ends with, if include is True.
include (bool) – Whether end_page is included into the new sub-Document.

Returns

A new sub-Document.

property default_annotation_set: konfuzio_sdk.data.AnnotationSet: Return the default Annotation Set of the Document.

delete(delete_online: bool = False): Delete Document.

delete_document_details(): Delete all local content information for the Document.

property document_folder: Get the path to the folder where all the Document information is cached locally.

download_document_details()

Retrieve data from a Document online in case Document has finished processing.

Data includes Document’s status, URL of its file, name of its file, date of las update, its text and pagination, Annotations and Annotation Sets; optionally, Category information.

eval_dict(use_view_annotations=False, use_correct=False, ignore_below_threshold=False) → List[dict]: Use this dict to evaluate Documents. The speciality: For every Span of an Annotation create one entry.

evaluate_regex(regex, label: konfuzio_sdk.data.Label, annotations: Optional[List[konfuzio_sdk.data.Annotation]] = None): Evaluate a regex based on the Document.

property file_path: Return path to file.

classmethod from_file(path: str, project: konfuzio_sdk.data.Project, dataset_status: int = 0, category_id: Optional[int] = None, callback_url: str = '', timeout: Optional[int] = None, sync: bool = True) → konfuzio_sdk.data.Document

Initialize Document from file with synchronous API call.

This class method will wait for the document to be processed by the server and then return the new Document. This may take a bit of time. When uploading many Documents, it is advised to set the sync option to False method.

Parameters

path – Path to file to be uploaded
project – If to filter by correct annotations
dataset_status – Dataset status of the Document (None: 0 Preparation: 1 Training: 2 Test: 3 Excluded: 4)
category_id – Category the Document belongs to (if unset, it will be assigned one by the server)
callback_url – Callback URL receiving POST call once extraction is done
timeout – Number of seconds to wait for response from the server
sync – Whether to wait for the file to be processed by the server

Returns

New Document

get_annotation_by_id(annotation_id: int) → konfuzio_sdk.data.Annotation

Return an Annotation by ID, searching within the Document.

Parameters: annotation_id – ID of the Annotation to get.

get_annotation_set_by_id(id_: int) → konfuzio_sdk.data.AnnotationSet

Return an Annotation Set by ID.

Parameters: id – ID of the Annotation Set to get.

get_annotations() → List[konfuzio_sdk.data.Annotation]: Get Annotations of the Document.

get_bbox() → Dict

Get bbox information per character of file. We don’t store bbox as an attribute to save memory.

Returns: Bounding box information per character in the Document.

get_bbox_by_page(page_index: int) → Dict[str, Dict]: Return list of all bboxes in a Page.

get_file(ocr_version: bool = True, update: bool = False)

Get OCR version of the original file.

Parameters

ocr_version – Bool to get the ocr version of the original file
update – Update the downloaded file even if it is already available

Returns

Path to the selected file.

get_images(update: bool = False)

Get Document Pages as PNG images.

Parameters: update – Update the downloaded images even they are already available
Returns: Path to PNG files.

get_page_by_id(page_id: int, original: bool = False) → konfuzio_sdk.data.Page

Get a Page by its ID.

Parameters: page_id (int) – An ID of the Page to fetch.

get_page_by_index(page_index: int): Return the Page by index.

get_segmentation(timeout: Optional[int] = None, num_retries: Optional[int] = None) → List

Retrieve the segmentation results for the Document.

Parameters

timeout – Number of seconds to wait for response from the server.
num_retries – Number of retries if the request fails.

Returns

A list of segmentation results for each Page in the Document.

get_text_in_bio_scheme(update=False) → List[Tuple[str, str]]

Get the text of the Document in the BIO scheme.

Parameters: update – Update the bio annotations even they are already available
Returns: list of tuples with each word in the text and the respective label

lose_weight(): Remove NO_LABEL, wrong and below threshold Annotations.

property maximum_confidence_category: Optional[konfuzio_sdk.data.Category]

Get the human revised Category of this Document, or the highest confidence one if not revised.

Returns: The found Category, or None if not present.

property maximum_confidence_category_annotation: Optional[konfuzio_sdk.data.CategoryAnnotation]

Get the human revised Category Annotation of this Document, or the highest confidence one if not revised.

Returns: The found Category Annotation, or None if not present.

property no_label_annotation_set: konfuzio_sdk.data.AnnotationSet

Return the Annotation Set for project.no_label Annotations.

We need to load the Annotation Sets from Server first (call self.annotation_sets()). If we create the no_label_annotation_set in the first place, the data from the Server is not be loaded anymore because _annotation_sets will no longer be None.

property number_of_lines: int: Calculate the number of lines.

property number_of_pages: int: Calculate the number of Pages.

property ocr_file_path: Return path to OCR PDF file.

property ocr_ready: Check if Document OCR is ready.

pages() → List[konfuzio_sdk.data.Page]: Get Pages of Document.

propose_splitting(splitting_ai) → List

Propose splitting for a multi-file Document.

Parameters: splitting_ai – An initialized SplittingAI class

save(): Save all local changes to Document to server.

save_meta_data(): Save local changes to Document metadata to server.

set_bboxes(characters: Dict[int, konfuzio_sdk.data.Bbox]): Set character Bbox dictionary.

set_category(category: konfuzio_sdk.data.Category) → None: Set the Category of the Document and the Category of all of its Pages as revised.

set_text_bbox_hashes() → None: Update hashes of Document text and bboxes. Can be used for checking later on if any changes happened.

spans(label: Optional[konfuzio_sdk.data.Label] = None, use_correct: bool = False, start_offset: int = 0, end_offset: Optional[int] = None, fill: bool = False) → List[konfuzio_sdk.data.Span]: Return all Spans of the Document.

property text: Get Document text. Once loaded stored in memory.

update(): Update Document information.

update_meta_data(assignee: Optional[int] = None, category_template: Optional[int] = None, category: Optional[konfuzio_sdk.data.Category] = None, data_file_name: Optional[str] = None, dataset_status: Optional[int] = None, status: Optional[List[Union[int, str]]] = None, **kwargs): Update document metadata information.

view_annotations(start_offset: int = 0, end_offset: Optional[int] = None) → List[konfuzio_sdk.data.Annotation]: Get the best Annotations, where the Spans are not overlapping.

Page¶

class konfuzio_sdk.data.Page(id_: Optional[int], document: konfuzio_sdk.data.Document, number: int, original_size: Tuple[float, float], image_size: Tuple[int, int] = (None, None), start_offset: Optional[int] = None, end_offset: Optional[int] = None, copy_of_id: Optional[int] = None)

Access the information about one Page of a Document.

For more details see https://dev.konfuzio.com/sdk/explanations.html#page-concept

add_category_annotation(category_annotation: konfuzio_sdk.data.CategoryAnnotation): Annotate a Page with a Category and confidence information.

annotation_sets() → List[konfuzio_sdk.data.AnnotationSet]: Show all Annotation Sets related to Annotations of the Page.

annotations(label: Optional[konfuzio_sdk.data.Label] = None, use_correct: bool = True, ignore_below_threshold: bool = False, start_offset: int = 0, end_offset: Optional[int] = None, fill: bool = False) → List[konfuzio_sdk.data.Annotation]: Get Page Annotations.

property category: Optional[konfuzio_sdk.data.Category]: Get the Category of the Page, based on human revised Category Annotation, or on highest confidence.

get_annotations_image(display_all: bool = False) → <module 'PIL.Image' from '/usr/local/lib/python3.8/site-packages/PIL/Image.py'>: Get Document Page as PNG with Annotations shown.

get_bbox(): Get bbox information per character of Page.

get_category_annotation(category, add_if_not_present: bool = False) → konfuzio_sdk.data.CategoryAnnotation

Retrieve the Category Annotation associated with a specific Category within this Page.

If no Category Annotation is found for the provided Category, one can be created based on the add_if_not_present argument.

Parameters

category (Category) – The Category for which to retrieve the Category Annotation.
add_if_not_present (bool) – If True, a Category Annotation will be added to the current Page if none is found. If False, a dummy Category Annotation will be created, not linked to any Document or Page.

Returns

The located or newly created Category Annotation.

Return type

CategoryAnnotation

get_image(update: bool = False) → PIL.Image.Image

Get Page as a Pillow Image object.

The Page image is loaded from a PNG file at Page.image_path. If the file is not present, or if update is True, it will be downloaded from the Konfuzio Host. Alternatively, if you don’t want to use a file, you can provide the image as bytes to Page.image_bytes. Then call this method to convert the bytes into a Pillow Image. In every case, the return value of this method and the attribute Page.image will be a Pillow Image.

Parameters: update – Whether to force download the Page PNG file.
Returns: A Pillow Image object for this Page’s image.

get_original_page() → konfuzio_sdk.data.Page

Return an “original” Page in case the current Page is a copy without an ID.

An “original” Page is a Page from the Document that is not a copy and not a Virtual Document. This Page has an ID.

The method is used in the File Splitting pipeline to retain the original Document’s information in the Sub-Documents that were created from its splitting. The original Document is a Document that has an ID and is not a deepcopy.

lines() → List[konfuzio_sdk.data.Span]: Return sorted list of Spans for each line in the Page.

property maximum_confidence_category_annotation: Optional[konfuzio_sdk.data.CategoryAnnotation]

Get the human revised Category Annotation of this Page, or the highest confidence one if not revised.

Returns: The found Category Annotation, or None if not present.

property number_of_lines: int: Calculate the number of lines in Page.

set_category(category: konfuzio_sdk.data.Category) → None

Set the Category of the Page.

Parameters: category – The Category to set for the Page.

spans(label: Optional[konfuzio_sdk.data.Label] = None, use_correct: bool = False, start_offset: int = 0, end_offset: Optional[int] = None, fill: bool = False) → List[konfuzio_sdk.data.Span]: Return all Spans of the Page.

property text: Get Document text corresponding to the Page.

view_annotations() → List[konfuzio_sdk.data.Annotation]: Get the best Annotations, where the Spans are not overlapping in Page.

Project¶

class konfuzio_sdk.data.Project(id_: Optional[int], project_folder=None, update=False, max_ram=None, strict_data_validation: bool = True, credentials: dict = {}, **kwargs)

Access the information of a Project.

For more details see https://dev.konfuzio.com/sdk/explanations.html#project-concept

add_category(category: konfuzio_sdk.data.Category)

Add Category to Project, if it does not exist.

Parameters: category – Category to add in the Project

add_document(document: konfuzio_sdk.data.Document): Add Document to Project, if it does not exist.

add_label(label: konfuzio_sdk.data.Label)

Add Label to Project, if it does not exist.

Parameters: label – Label to add in the Project

add_label_set(label_set: konfuzio_sdk.data.LabelSet)

Add Label Set to Project, if it does not exist.

Parameters: label_set – Label Set to add in the Project

property ai_models: Return all AIs.

del_document_by_id(document_id: int, delete_online: bool = False) → konfuzio_sdk.data.Document: Delete Document by its ID.

delete(): Delete the Project folder.

property documents: Return Documents with status training.

property documents_folder: str: Calculate the regex folder of the Project.

download_training_and_test_data() → None

Migrate your Project to another HOST.

See https://dev.konfuzio.com/web/migration-between-konfuzio-server-instances/index.html

property excluded_documents: Return Documents which have been excluded.

export_project_data(include_ais=False, training_and_test_documents=True) → None

“Export the Project data including Training, Test Documents and AI models.

Include_ais: Whether to include AI models in the export
Training_and_test_documents: Whether to include training & test documents in the export.

get(update=False)

Access meta information of the Project.

Parameters: update – Update the downloaded information even it is already available

get_categories(reload: bool = True): Load Categories for all Label Sets in the Project.

get_category_by_id(id_: int) → konfuzio_sdk.data.Category

Return a Category by ID.

Parameters: id – ID of the Category to get.

get_credentials(key)

Return the value of the key in the credentials dict or in the config file.

Returns None and emits a warning if the key is not found.

Parameters: key – Key of the credential to get.

get_document_by_id(document_id: int) → konfuzio_sdk.data.Document: Return Document by its ID.

get_label_by_id(id_: int) → konfuzio_sdk.data.Label

Return a Label by ID.

Parameters: id – ID of the Label to get.

get_label_by_name(name: str) → konfuzio_sdk.data.Label: Return Label by its name.

get_label_set_by_id(id_: int) → konfuzio_sdk.data.LabelSet

Return a Label Set by ID.

Parameters: id – ID of the Label Set to get.

get_label_set_by_name(name: str) → konfuzio_sdk.data.LabelSet: Return a Label Set by ID.

get_label_sets(reload=False): Get LabelSets in the Project.

get_labels(reload=False) → List[konfuzio_sdk.data.Label]: Get ID and name of any Label in the Project.

get_meta(reload=False)

Get the list of all Documents in the Project and their information.

Returns: Information of the Documents in the Project.

init_or_update_document(from_online=False)

Initialize or update Documents from local files to then decide about full, incremental or no update.

Parameters: from_online – If True, all Document metadata info is first reloaded with latest changes in the server

property label_sets: Return Project LabelSets.

property labels: Return Project Labels.

lose_weight(): Delete data of the instance.

property max_ram: Return maximum memory used by AI models.

property meta_data: Return Project meta data.

property model_folder: str: Calculate the model folder of the Project.

property no_status_documents: Return Documents with no status.

property online_documents_dict: Dict: Return a dictionary of online documents using their id as key.

property preparation_documents: Return Documents with status test.

property project_folder: str: Calculate the data document_folder of the Project.

property regex_folder: str: Calculate the regex folder of the Project.

property test_documents: Return Documents with status test.

property virtual_documents: Return Documents created virtually.

write_meta_of_files(): Overwrite meta-data of Documents in Project.

write_project_files(): Overwrite files with Project, Label, Label Set information.

API call wrappers¶

[source]

Connect to the Konfuzio Server to receive or send data.

TimeoutHTTPAdapter¶

class konfuzio_sdk.api.TimeoutHTTPAdapter(timeout, *args, **kwargs)

Combine a retry strategy with a timeout strategy.

Urllib3

TimeoutHTTPAdapter idea used from the following
Blogpost

build_response(req, resp): Throw error for any HTTPError that is not part of the retry strategy.

send(request, *args, **kwargs): Use timeout policy if not otherwise declared.

konfuzio_sdk.api.init_env(user: str, password: str, host: str = 'https://app.konfuzio.com', working_directory='/builds/konfuzio/dev', file_ending: str = '.env')¶

Add the .env file to the working directory.

Parameters

user – Username to log in to the host
password – Password to log in to the host
host – URL of host.
working_directory – Directory where file should be added
file_ending – Ending of file.

konfuzio_sdk.api.konfuzio_session(token: Optional[str] = None, timeout: Optional[int] = None, num_retries: Optional[int] = None, host: Optional[str] = None)¶

Create a session incl. Token to the KONFUZIO_HOST.

Parameters

token – Konfuzio Token to connect to the host.
timeout – Timeout in seconds.
num_retries – Number of retries if the request fails.
host – Host to connect to.

Returns

Request session.

konfuzio_sdk.api.get_project_list(session=None)¶

Get the list of all Projects for the user.

Parameters: session – Konfuzio session with Retry and Timeout policy
Returns: Response object

konfuzio_sdk.api.get_project_details(project_id: int, session=None) → dict¶

Get Project’s metadata.

Parameters

project_id – ID of the Project
session – Konfuzio session with Retry and Timeout policy

Returns

Project metadata

konfuzio_sdk.api.get_project_labels(project_id: int, session=None) → dict¶

Get Project’s Labels.

Parameters

project_id – An ID of a Project to get Labels from.
session – Konfuzio session with Retry and Timeout policy

konfuzio_sdk.api.get_project_label_sets(project_id: int, session=None) → dict¶

Get Project’s Label Sets.

Parameters

project_id – An ID of a Project to get Label Sets from.
session – Konfuzio session with Retry and Timeout policy

konfuzio_sdk.api.create_new_project(project_name, session=None)¶

Create a new Project for the user.

Parameters

project_name – name of the project you want to create
session – Konfuzio session with Retry and Timeout policy

Returns

Response object

konfuzio_sdk.api.get_document_details(document_id: int, session=None)¶

Use the text-extraction server to retrieve the data from a document.

Parameters

document_id – ID of the document
session – Konfuzio session with Retry and Timeout policy

Returns

Data of the document.

konfuzio_sdk.api.get_document_annotations(document_id: int, session=None)¶

Get Annotations of a Document.

Parameters

document_id – ID of the Document.
session – Konfuzio session with Retry and Timeout policy

Returns

List of the Annotations of the Document.

konfuzio_sdk.api.get_document_bbox(document_id: int, session=None)¶

Get Bboxes for a Document.

Parameters

document_id – ID of the Document.
session – Konfuzio session with Retry and Timeout policy

Returns

List of Bboxes of characters in the Document

konfuzio_sdk.api.get_page_image(document_id: int, page_number: int, session=None, thumbnail: bool = False)¶

Load image of a Page as Bytes.

Parameters

page_number – Number of the Page
thumbnail – Download Page image as thumbnail
session – Konfuzio session with Retry and Timeout policy

Returns

Bytes of the Image.

konfuzio_sdk.api.post_document_annotation(document_id: int, spans: List, label_id: int, confidence: Optional[float] = None, revised: bool = False, is_correct: bool = False, session=None, **kwargs)¶

Add an Annotation to an existing document.

You must specify either annotation_set_id or label_set_id.

Use annotation_set_id if an Annotation Set already exists. You can find the list of existing Annotation Sets by using the GET endpoint of the Document.

Using label_set_id will create a new Annotation Set associated with that Label Set. You can only do this if the Label Set has has_multiple_sections set to True.

Parameters

document_id – ID of the file
spans – Spans that constitute the Annotation
label_id – ID of the Label
confidence – Confidence of the Annotation still called Accuracy by text-annotation
revised – If the Annotation is revised or not (bool)
is_correct – If the Annotation is corrected or not (bool)
session – Konfuzio session with Retry and Timeout policy

Returns

Response status.

konfuzio_sdk.api.change_document_annotation(annotation_id: int, session=None, **kwargs)¶

Change something about an Annotation.

Parameters

annotation_id – ID of an Annotation to be changed
session – Konfuzio session with Retry and Timeout policy

Returns

Response status.

konfuzio_sdk.api.delete_document_annotation(annotation_id: int, session=None, delete_from_database: bool = False, **kwargs)¶

Delete a given Annotation of the given document.

For AI training purposes, we recommend setting delete_from_database to False if you don’t want to remove Annotation permanently. This creates a negative feedback Annotation and does not remove it from the database.

Parameters

annotation_id – ID of the annotation
session – Konfuzio session with Retry and Timeout policy

Returns

Response status.

konfuzio_sdk.api.update_document_konfuzio_api(document_id: int, session=None, **kwargs)¶

Update an existing Document via Konfuzio API.

Parameters

document_id – ID of the document
session – Konfuzio session with Retry and Timeout policy

Returns

Response status.

konfuzio_sdk.api.download_file_konfuzio_api(document_id: int, ocr: bool = True, session=None)¶

Download file from the Konfuzio server using the Document id_.

Django authentication is form-based, whereas DRF uses BasicAuth.

Parameters

document_id – ID of the document
ocr – Bool to get the ocr version of the document
session – Konfuzio session with Retry and Timeout policy

Returns

The downloaded file.

konfuzio_sdk.api.get_results_from_segmentation(doc_id: int, project_id: int, session=None) → List[List[dict]]¶

Get bbox results from segmentation endpoint.

Parameters

doc_id – ID of the document
project_id – ID of the Project.
session – Konfuzio session with Retry and Timeout policy

konfuzio_sdk.api.get_project_categories(project_id: Optional[int] = None, session=None) → List[Dict]¶

Get a list of Categories of a Project.

Parameters

project_id – ID of the Project.
session – Konfuzio session with Retry and Timeout policy

konfuzio_sdk.api.upload_ai_model(ai_model_path: str, project_id: Optional[int] = None, category_id: Optional[int] = None, session=None)¶

Upload an ai_model to the text-annotation server.

Parameters

ai_model_path – Path to the ai_model
project_id – An ID of a Project to which the AI is uploaded. Needed for the File Splitting and Categorization

AIs because they function on a Project level. :param category_id: An ID of a Category on which the AI is trained. Needed for the Extraction AI because it functions on a Category level and requires a single Category. :param session: session to connect to server :raises: ValueError when neither project_id nor category_id is specified. :raises: HTTPError when a request is unsuccessful. :return:

konfuzio_sdk.api.delete_ai_model(ai_model_id: int, ai_type: str, session=None)¶

Delete an AI model from the server.

Parameters

ai_model_id – an ID of the model to be deleted.
ai_type – Should be one of the following: ‘filesplitting’, ‘extraction’, ‘categorization’.
session – session to connect to the server.

Raises

ValueError if ai_type is not correctly specified.

Raises

ConnectionError when a request is unsuccessful.

konfuzio_sdk.api.update_ai_model(ai_model_id: int, ai_type: str, patch: bool = True, session=None, **kwargs)¶

Update an AI model from the server.

Parameters

ai_model_id – an ID of the model to be updated.
ai_type – Should be one of the following: ‘filesplitting’, ‘extraction’, ‘categorization’.
patch – If true, adds info instead of replacing it.
session – session to connect to the server.

Raises

ValueError if ai_type is not correctly specified.

Raises

HTTPError when a request is unsuccessful.

konfuzio_sdk.api.get_all_project_ais(project_id: int, session=None) → dict¶

Fetch all types of AIs for a specific project.

Parameters

project_id – ID of the Project
session – Konfuzio session with Retry and Timeout policy
host – Konfuzio host

Returns

Dictionary with lists of all AIs for a specific project

konfuzio_sdk.api.export_ai_models(project, session=None) → int¶

Export all AI Model files for a specific Project.

Param: project: Konfuzio Project
Returns: Number of exported AIs

CLI tools¶

[source]

Command Line interface to the konfuzio_sdk package.

konfuzio_sdk.cli.parse_args(parser)¶: Parse command line arguments using sub-parsers for each command.

konfuzio_sdk.cli.credentials(args)¶: Retrieve user input or use CLI arguments.

Extras¶

[source]

Initialize AI-related dependencies safely.

PackageWrapper¶

class konfuzio_sdk.extras.PackageWrapper(package_name: str, required_for_modules: Optional[List[str]] = None): Heavy dependencies are encapsulated and handled if they are not part of the lightweight SDK installation.

ModuleWrapper¶

class konfuzio_sdk.extras.ModuleWrapper(module: str): Handle missing dependencies’ classes to avoid metaclass conflict.

Normalization¶

[source]

Convert the Span according to the data_type of the Annotation.

konfuzio_sdk.normalize.normalize_to_float(offset_string: str) → Optional[float]¶: Given an offset_string: str this function tries to translate the offset-string to a number.

konfuzio_sdk.normalize.normalize_to_positive_float(offset_string: str) → Optional[float]¶: Given an offset_string this function tries to translate the offset-string to an absolute number (ignores +/-).

konfuzio_sdk.normalize.normalize_to_percentage(offset_string: str) → Optional[float]¶: Given an Annotation this function tries to translate the offset-string to an percentage -a float between 0 -1.

konfuzio_sdk.normalize.normalize_to_date(offset_string: str) → Optional[str]¶: Given an Annotation this function tries to translate the offset-string to a date in the format ‘DD.MM.YYYY’.

konfuzio_sdk.normalize.normalize_to_bool(offset_string: str)¶: Given an offset_string this function tries to translate the offset-string to a bool.

konfuzio_sdk.normalize.roman_to_float(offset_string: str) → Optional[float]¶: Convert a Roman numeral to an integer.

konfuzio_sdk.normalize.normalize(offset_string, data_type)¶: Wrap all normalize functionality.

Utils¶

[source]

Utils for the konfuzio sdk package.

konfuzio_sdk.utils.sdk_isinstance(instance, klass)¶

Implement a custom isinstance which is compatible with cloudpickle saving by value.

When using cloudpickle with “register_pickle_by_value” the classes of “konfuzio.data” will be loaded in the “types” module. For this case the builtin method “isinstance” will return False because it tries to compare “types.Document” with “konfuzio_sdk.data.Document”.

konfuzio_sdk.utils.exception_or_log_error(msg: str, handler: str = 'sdk', fail_loudly: typing.Optional[bool] = True, exception_type: typing.Optional[typing.Type[Exception]] = <class 'ValueError'>) → None¶

Log error or raise an exception.

This function is needed to control error handling in production. If fail_loudly is set to True, the function raises an exception to type exception_type with a message and handler in the format `{“message” : msg,

“handler” : handler}`.

If fail_loudly is set to False, the function logs an error with msg using the logger.

Parameters

msg – (str): The error message to be logged or raised.
handler – (str): The handler associated with the error. Defaults to “sdk”
fail_loudly – A flag indicating whether to raise an exception or log the error. Defaults to True.
exception_type – The type of exception to be raised. Defaults to ValueError.

Returns

None

konfuzio_sdk.utils.get_id(include_time: bool = False) → str¶

Generate a unique ID.

Parameters: include_time – Bool to include the time in the unique ID
Returns: Unique ID

konfuzio_sdk.utils.is_file(file_path, raise_exception=True, maximum_size=100000000, allow_empty=False) → bool¶

Check if file is available or raise error if it does not exist.

Parameters

file_path – Path to the file to be checked
raise_exception – Will raise an exception if file is not available
maximum_size – Maximum size of the expected file, default < 100 mb
allow_empty – Bool to allow empty files

Returns

True or false depending on the existence of the file

konfuzio_sdk.utils.memory_size_of(obj) → int¶: Return memory size of object in bytes.

konfuzio_sdk.utils.normalize_memory(memory: Union[None, str]) → Optional[int]¶

Return memory size in human-readable form to int of number of bytes.

Parameters: memory – Memory size in human readable form (e.g. “50MB”).
Returns: int of bytes if valid, else None

konfuzio_sdk.utils.get_timestamp(konfuzio_format='%Y-%m-%d-%H-%M-%S') → str¶

Return formatted timestamp.

Parameters: konfuzio_format – Format of the timestamp (e.g. year-month-day-hour-min-sec)
Returns: Timestamp

konfuzio_sdk.utils.load_image(input_file: Union[str, _io.BytesIO])¶

Load an image by path or via io.Bytes, e.g. via download by URL.

Parameters: input_file – Path to image or image in bytes format
Returns: Loaded image

konfuzio_sdk.utils.get_file_type(input_file: Optional[Union[str, _io.BytesIO, bytes]] = None) → str¶

Get the type of a file.

Parameters: input_file – Path to the file or file in bytes format
Returns: Name of file type

konfuzio_sdk.utils.get_file_type_and_extension(input_file: Optional[Union[str, _io.BytesIO, bytes]] = None) → Tuple[str, str]¶

Get the type of a file via the filetype library, which checks the magic bytes to see the internal format.

Parameters: input_file – Path to the file or file in bytes format
Returns: Name of file type

konfuzio_sdk.utils.does_not_raise()¶

Serve a complement to raise, no-op context manager does_not_raise.

docs.pytest.org/en/latest/example/parametrize.html#parametrizing-conditional-raising

konfuzio_sdk.utils.convert_to_bio_scheme(document) → List[Tuple[str, str]]¶

Mark all the entities in the text as per the BIO scheme.

The splitting is using the sequence of words, expecting some characters like “.” a separate token.

Hello O , O it O ‘s O Helm B-ORG und I-ORG Nagel I-ORG . O

Parameters: document – Document to be converted into the bio scheme
Returns: list of tuples with each word in the text an the respective Label

konfuzio_sdk.utils.slugify(value)¶

Taken from https://github.com/django/django/blob/master/django/utils/text.py.

Convert to ASCII if ‘allow_unicode’ is False. Convert spaces or repeated dashes to single dashes. Remove characters that aren’t alphanumerics, underscores, or hyphens. Convert to lowercase. Also strip leading and trailing whitespace, dashes, and underscores.

konfuzio_sdk.utils.amend_file_name(file_name: str, append_text: str = '', append_separator: str = '_', new_extension: Optional[str] = None) → str¶

Append text to a filename in front of extension.

example found here: https://stackoverflow.com/a/37487898

Parameters

new_extension – Change the file extension
file_path – Name of a file, e.g. file.pdf
append_text – Text you you want to append between file name ane extension

Returns

extended path to file

konfuzio_sdk.utils.amend_file_path(file_path: str, append_text: str = '', append_separator: str = '_', new_extension: Optional[str] = None)¶

Similar to amend_file_name however the file_name is interpreted as a full path.

Parameters

new_extension – Change the file extension
file_path – Name of a file, e.g. file.pdf
append_text – Text you you want to append between file name ane extension

Returns

extended path to file

konfuzio_sdk.utils.get_sentences(text: str, offsets_map: Optional[dict] = None, language: str = 'german') → List[dict]¶

Split a text into sentences using the sentence tokenizer from the package nltk.

Parameters

text – Text to split into sentences
offsets_map – mapping between the position of the character in the input text and the offset in the text

of the document :param language: language of the text :return: List with a dict per sentence with its text and its start and end offsets in the text of the document.

konfuzio_sdk.utils.map_offsets(characters_bboxes: list) → dict¶

Map the position of the character to its offset.

E.g.: characters: x, y, z, w characters offsets: 2, 3, 20, 22

The first character (x) has the offset 2. The fourth character (w) has the offset 22. …

offsets_map: {0: 2, 1: 3, 2: 20, 3: 22}

Parameters: characters_bboxes – Bounding boxes information of the characters.
Returns: Mapping of the position of the characters and its offsets.

konfuzio_sdk.utils.detectron_get_paragraph_bboxes(detectron_document_results: List[List[Dict]], document) → List[List[Bbox]]¶: Call detectron Bbox corresponding to each paragraph.

konfuzio_sdk.utils.iter_before_and_after(iterable, before=1, after=None, fill=None)¶: Iterate and provide before and after element. Generalized from http://stackoverflow.com/a/1012089.

konfuzio_sdk.utils.get_sdk_version()¶: Get a version of current Konfuzio SDK used.

konfuzio_sdk.utils.get_spans_from_bbox(selection_bbox: Bbox) → List[Span]¶: Get a list of Spans for all the text contained within a Bbox.

Tokenizers¶

[source]

Generic tokenizer.

Abstract Tokenizer¶

class konfuzio_sdk.tokenizer.base.AbstractTokenizer

Abstract definition of a Tokenizer.

evaluate(document: konfuzio_sdk.data.Document) → pandas.core.frame.DataFrame

Compare a Document with its tokenized version.

Parameters: document – Document to evaluate
Returns: Evaluation DataFrame

evaluate_dataset(dataset_documents: List[konfuzio_sdk.data.Document]) → konfuzio_sdk.evaluate.ExtractionEvaluation

Evaluate the tokenizer on a dataset of documents.

Parameters: dataset_documents – Documents to evaluate
Returns: ExtractionEvaluation instance

abstract found_spans(document: konfuzio_sdk.data.Document) → List[konfuzio_sdk.data.Span]: Find all Spans in a Document that can be found by a Tokenizer.

get_runtime_info() → pandas.core.frame.DataFrame

Get the processing runtime information as DataFrame.

Returns: processing time Dataframe containing the processing duration of all steps of the tokenization.

lose_weight(): Delete processing steps.

missing_spans(document: konfuzio_sdk.data.Document) → List[konfuzio_sdk.data.Span]

Apply a Tokenizer on a Document and find all Spans that cannot be found.

Use this approach to sequentially work on remaining Spans after a Tokenizer ran on a List of Documents.

Parameters: document – A Document
Returns: A list containing all missing Spans.

abstract tokenize(document: konfuzio_sdk.data.Document) → konfuzio_sdk.data.Document

Create Annotations with 1 Span based on the result of the Tokenizer.

Parameters: document – Document to tokenize, can have been tokenized before
Returns: Document with Spans created by the Tokenizer.

List Tokenizer¶

class konfuzio_sdk.tokenizer.base.ListTokenizer(tokenizers: List[konfuzio_sdk.tokenizer.base.AbstractTokenizer])

Use multiple tokenizers.

found_spans(document: konfuzio_sdk.data.Document) → List[konfuzio_sdk.data.Span]: Run found_spans in the given order on a Document.

lose_weight(): Delete processing steps.

span_match(span: konfuzio_sdk.data.Span) → bool: Run span_match in the given order.

tokenize(document: konfuzio_sdk.data.Document) → konfuzio_sdk.data.Document: Run tokenize in the given order on a Document.

Rule Based Tokenizer¶

Regex tokenizers.

Regex Tokenizer¶

class konfuzio_sdk.tokenizer.regex.RegexTokenizer(regex: str)

Tokenizer based on a single regex.

found_spans(document: konfuzio_sdk.data.Document) → List[konfuzio_sdk.data.Span]

Find Spans found by the Tokenizer and add Tokenizer info to Span.

Parameters: document – Document with Annotation to find.
Returns: List of Spans found by the Tokenizer.

span_match(span: konfuzio_sdk.data.Span) → bool: Check if Span is detected by Tokenizer.

tokenize(document: konfuzio_sdk.data.Document) → konfuzio_sdk.data.Document

Create Annotations with 1 Span based on the result of the Tokenizer.

Parameters: document – Document to tokenize, can have been tokenized before
Returns: Document with Spans created by the Tokenizer.

Regex tokenizers.

class konfuzio_sdk.tokenizer.regex.CapitalizedTextTokenizer¶

Tokenizer based on capitalized text.

Example:: “Company is Company A&B GmbH now” -> “Company A&B GmbH”

class konfuzio_sdk.tokenizer.regex.ColonOrWhitespacePrecededTokenizer¶

Tokenizer based on text preceded by colon.

Example:: “write to: name” -> “name”

class konfuzio_sdk.tokenizer.regex.ColonPrecededTokenizer¶

Tokenizer based on text preceded by colon.

Example:: “write to: name” -> “name”

class konfuzio_sdk.tokenizer.regex.ConnectedTextTokenizer¶

Tokenizer based on text connected by 1 whitespace.

Example:: r”This is na description. Occupies a paragraph.” -> “This is”, “a description. Occupies a paragraph.”

class konfuzio_sdk.tokenizer.regex.LineUntilCommaTokenizer¶

Tokenizer based on text preceded by colon.

Example:: “n Company und A&B GmbH,n” -> “Company und A&B GmbH”

class konfuzio_sdk.tokenizer.regex.NonTextTokenizer¶

Tokenizer based on non text - numbers and separators.

Example:: “date 01. 01. 2022” -> “01. 01. 2022”

class konfuzio_sdk.tokenizer.regex.NumbersTokenizer¶

Tokenizer based on numbers.

Example:: “N. 1242022 123 ” -> “1242022 123”

class konfuzio_sdk.tokenizer.regex.RegexTokenizer(regex: str)¶

Tokenizer based on a single regex.

found_spans(document: konfuzio_sdk.data.Document) → List[konfuzio_sdk.data.Span]¶

Find Spans found by the Tokenizer and add Tokenizer info to Span.

Parameters: document – Document with Annotation to find.
Returns: List of Spans found by the Tokenizer.

span_match(span: konfuzio_sdk.data.Span) → bool¶: Check if Span is detected by Tokenizer.

tokenize(document: konfuzio_sdk.data.Document) → konfuzio_sdk.data.Document¶

Create Annotations with 1 Span based on the result of the Tokenizer.

Parameters: document – Document to tokenize, can have been tokenized before
Returns: Document with Spans created by the Tokenizer.

class konfuzio_sdk.tokenizer.regex.WhitespaceNoPunctuationTokenizer¶

Tokenizer based on whitespaces without punctuation.

Example:: “street Name 1-2b,” -> “street”, “Name”, “1-2b”

class konfuzio_sdk.tokenizer.regex.WhitespaceTokenizer¶

Tokenizer based on whitespaces.

Example:: “street Name 1-2b,” -> “street”, “Name”, “1-2b,”

Sentence and Paragraph tokenizers.

Paragraph Tokenizer¶

class konfuzio_sdk.tokenizer.paragraph_and_sentence.ParagraphTokenizer(mode: str = 'detectron', line_height_ratio: float = 0.8, height: Optional[Union[int, float]] = None, create_detectron_labels: bool = False)

Tokenizer splitting Document into paragraphs.

found_spans(document: konfuzio_sdk.data.Document): Sentence found spans.

tokenize(document: konfuzio_sdk.data.Document) → konfuzio_sdk.data.Document: Create one multiline Annotation per paragraph detected.

Sentence Tokenizer¶

class konfuzio_sdk.tokenizer.paragraph_and_sentence.SentenceTokenizer(mode: str = 'detectron', line_height_ratio: float = 0.8, height: Optional[Union[int, float]] = None, create_detectron_labels: bool = False)

Tokenizer splitting Document into Sentences.

found_spans(document: konfuzio_sdk.data.Document): Sentence found spans.

tokenize(document: konfuzio_sdk.data.Document) → konfuzio_sdk.data.Document: Create one multiline Annotation per sentence detected.

Extraction AI¶

[source]

Extract information from Documents.

Conventional template matching based approaches fail to generalize well to document images of unseen templates, and are not robust against text recognition errors.

We follow the approach proposed by Sun et al. (2021) to encode both the visual and textual features of detected text regions, and edges of which represent the spatial relations between neighboring text regions. Their experiments validate that all information including visual features, textual features and spatial relations can benefit key information extraction.

We reduce the hardware requirements from 1 NVIDIA Titan X GPUs with 12 GB memory to a 1 CPU and 16 GB memory by replacing the end-to-end pipeline into two parts.

Sun, H., Kuang, Z., Yue, X., Lin, C., & Zhang, W. (2021). Spatial Dual-Modality Graph Reasoning for Key Information Extraction. arXiv. https://doi.org/10.48550/ARXIV.2103.14470

Base Model¶

class konfuzio_sdk.trainer.information_extraction.BaseModel

Base model to define common methods for all AIs.

abstract check_is_ready(): Check if the Model is ready for inference.

ensure_model_memory_usage_within_limit(max_ram: Optional[str] = None)

Ensure that a model is not exceeding allowed max_ram.

Parameters: max_ram (str) – Specify maximum memory usage condition to save model.

abstract static has_compatible_interface(other)

Validate that an instance of an AI implements the same interface defined by this AI class.

Parameters: other – An instance of an AI to compare with.

static load_model(pickle_path: str, max_ram: Optional[str] = None)

Load a previously saved instance of the model.

Parameters

pickle_path (str) – Path to the pickled model.

Raises

FileNotFoundError – If the path is invalid.
OSError – When the data is corrupted or invalid and cannot be loaded.
TypeError – When the loaded pickle isn’t recognized as a Konfuzio AI model.

Returns

Extraction AI model.

property name: Model class name.

name_lower(): Convert class name to machine-readable name.

abstract property pkl_file_path: Generate a path for a resulting pickle file.

reduce_model_weight(): Remove all non-strictly necessary parameters before saving.

save(output_dir: Optional[str] = None, include_konfuzio=True, reduce_weight=True, compression: str = 'lz4', keep_documents=False, max_ram=None)

Save the label model as a compressed pickle object to the release directory.

Saving is done by: getting the serialized pickle object (via cloudpickle), “optimizing” the serialized object with the built-in pickletools.optimize function (see: https://docs.python.org/3/library/pickletools.html), saving the optimized serialized object.

We then compress the pickle file using shutil.copyfileobject which writes in chunks to avoid loading the entire pickle file in memory.

Finally, we delete the cloudpickle file and are left with the compressed pickle file which has a .pkl.lz4 or .pkl.bz2 extension.

For more info on pickle serialization and including dependencies read https://github.com/cloudpipe/cloudpickle#overriding-pickles-serialization-mechanism-for-importable-constructs

Parameters

output_dir – Folder to save AI model in. If None, the default Project folder is used.
include_konfuzio – Enables pickle serialization as a value, not as a reference.
reduce_weight – Remove all non-strictly necessary parameters before saving.
compression – Compression algorithm to use. Default is lz4, bz2 is also supported.
max_ram – Specify maximum memory usage condition to save model.

Raises

MemoryError – When the size of the model in memory is greater than the maximum value.

Returns

Path of the saved model file.

abstract property temp_pkl_file_path: Generate a path for temporary pickle file.

AbstractExtractionAI¶

class konfuzio_sdk.trainer.information_extraction.AbstractExtractionAI(category: konfuzio_sdk.data.Category, *args, **kwargs)

Parent class for all Extraction AIs, to extract information from unstructured human-readable text.

static add_extractions_as_annotations(extractions: pandas.core.frame.DataFrame, document: konfuzio_sdk.data.Document, label: konfuzio_sdk.data.Label, label_set: konfuzio_sdk.data.LabelSet, annotation_set: konfuzio_sdk.data.AnnotationSet) → None: Add the extraction of a model to the document.

check_is_ready()

Check if the ExtractionAI is ready for the inference.

It is assumed that the model is ready if a Category is set, and is ready for extraction.

Raises: AttributeError – When no Category is specified.

evaluate(): Use as placeholder Function.

extract(document: konfuzio_sdk.data.Document) → konfuzio_sdk.data.Document: Perform preliminary extraction steps.

extraction_result_to_document(document: konfuzio_sdk.data.Document, extraction_result: dict) → konfuzio_sdk.data.Document: Return a virtual Document annotated with AI Model output.

fit(): Use as placeholder Function because the Abstract AI does not train a classifier.

static flush_buffer(buffer: List[pandas.core.series.Series], doc_text: str) → Dict

Merge a buffer of entities into a dictionary (which will eventually be turned into a DataFrame).

A buffer is a list of pandas.Series objects.

static has_compatible_interface(other) → bool

Validate that an instance of an Extraction AI implements the same interface as AbstractExtractionAI.

An Extraction AI should implement methods with the same signature as: - AbstractExtractionAI.__init__ - AbstractExtractionAI.fit - AbstractExtractionAI.extract - AbstractExtractionAI.check_is_ready

Parameters: other – An instance of an Extraction AI to compare with.

static is_valid_horizontal_merge(row: pandas.core.series.Series, buffer: List[pandas.core.series.Series], doc_text: str, max_offset_distance: int = 5) → bool

Verify if the merging that we are trying to do is valid.

A merging is valid only if:

All spans have the same predicted Label
Confidence of predicted Label is above the Label threshold
All spans are on the same line
No extraneous characters in between spans
A maximum of 5 spaces in between spans
The Label type is not one of the following: ‘Number’, ‘Positive Number’, ‘Percentage’, ‘Date’ OR the resulting merging create a span normalizable to the same type

Parameters

row – Row candidate to be merged to what is already in the buffer.
buffer – Previous information.
doc_text – Text of the document.
max_offset_distance – Maximum distance between two entities that can be merged.

Returns

If the merge is valid or not.

static load_model(pickle_path: str, max_ram: Optional[str] = None)

Load the model and check if it has the interface compatible with the class.

Parameters

pickle_path (str) – Path to the pickled model.

Raises

FileNotFoundError – If the path is invalid.
OSError – When the data is corrupted or invalid and cannot be loaded.
TypeError – When the loaded pickle isn’t recognized as a Konfuzio AI model.

Returns

Extraction AI model.

classmethod merge_horizontal(res_dict: Dict, doc_text: str) → Dict

Merge contiguous spans with same predicted label.

See more details at https://dev.konfuzio.com/sdk/explanations.html#horizontal-merge

property pkl_file_path: str: Generate a path for a resulting pickle file.

property project: Get RFExtractionAI Project.

property temp_pkl_file_path: str: Generate a path for temporary pickle file.

Random Forest Extraction AI¶

class konfuzio_sdk.trainer.information_extraction.RFExtractionAI(n_nearest: int = 2, first_word: bool = True, n_estimators: int = 100, max_depth: int = 100, no_label_limit: Optional[Union[int, float]] = None, n_nearest_across_lines: bool = False, use_separate_labels: bool = True, category: Optional[konfuzio_sdk.data.Category] = None, tokenizer=None, *args, **kwargs)

Encode visual and textual features to extract text regions.

Fit an extraction pipeline to extract linked Annotations.

Both Label and Label Set classifiers are using a RandomForestClassifier from scikit-learn to run in a low memory and single CPU environment. A random forest classifier is a group of decision trees classifiers, see: https://scikit-learn.org/stable/modules/generated/sklearn.ensemble.RandomForestClassifier.html

The parameters of this class allow to select the Tokenizer, to configure the Label and Label Set classifiers and to select the type of features used by the Label and Label Set classifiers.

They are divided in: - tokenizer selection - parametrization of the Label classifier - parametrization of the Label Set classifier - features for the Label classifier - features for the Label Set classifier

By default, the text of the Documents is split into smaller chunks of text based on whitespaces (‘WhitespaceTokenizer’). That means that all words present in the text will be shown to the AI. It is possible to define if the splitting of the text into smaller chunks should be done based on regexes learned from the Spans of the Annotations of the Category (‘tokenizer_regex’) or if to use a model from Spacy library for German language (‘tokenizer_spacy’). Another option is to use a pre-defined list of tokenizers based on regexes (‘tokenizer_regex_list’) and, on top of the pre-defined list, to create tokenizers that match what is missed by those (‘tokenizer_regex_combination’).

Some parameters of the scikit-learn RandomForestClassifier used for the Label and/or Label Set classifier can be set directly in Konfuzio Server (‘label_n_estimators’, ‘label_max_depth’, ‘label_class_weight’, ‘label_random_state’, ‘label_set_n_estimators’, ‘label_set_max_depth’).

Features are measurable pieces of data of the Annotation. By default, a combination of features is used that includes features built from the text of the Annotation (‘string_features’), features built from the position of the Annotation in the Document (‘spatial_features’) and features from the Spans created by a WhitespaceTokenizer on the left or on the right of the Annotation (‘n_nearest_left’, ‘n_nearest_right’, ‘n_nearest_across_lines). It is possible to exclude any of them (‘spatial_features’, ‘string_features’, ‘n_nearest_left’, ‘n_nearest_right’) or to specify the number of Spans created by a WhitespaceTokenizer to consider (‘n_nearest_left’, ‘n_nearest_right’).

While extracting, the Label Set classifier takes the predictions from the Label classifier as input. The Label Set classifier groups them into Annotation sets.

check_is_ready()

Check if the ExtractionAI is ready for the inference.

It is assumed that the model is ready if a Tokenizer and a Category were set, Classifiers were set and trained.

Raises

AttributeError – When no Tokenizer is specified.
AttributeError – When no Category is specified.
AttributeError – When no Label Classifier has been provided.

evaluate_clf(use_training_docs: bool = False) → konfuzio_sdk.evaluate.ExtractionEvaluation: Evaluate the Label classifier.

evaluate_full(strict: bool = True, use_training_docs: bool = False, use_view_annotations: bool = True) → konfuzio_sdk.evaluate.ExtractionEvaluation

Evaluate the full pipeline on the pipeline’s Test Documents.

Parameters

strict – Evaluate on a Character exact level without any postprocessing.
use_training_docs – Bool for whether to evaluate on the training documents instead of testing documents.

Returns

Evaluation object.

evaluate_label_set_clf(use_training_docs: bool = False) → konfuzio_sdk.evaluate.ExtractionEvaluation: Evaluate the LabelSet classifier.

evaluate_tokenizer(use_training_docs: bool = False) → konfuzio_sdk.evaluate.ExtractionEvaluation: Evaluate the tokenizer.

extract(document: konfuzio_sdk.data.Document) → konfuzio_sdk.data.Document

Infer information from a given Document.

Parameters: document – Document object
Returns: Document with predicted labels
Raises: AttributeError: When missing a Tokenizer NotFittedError: When CLF is not fitted

extract_from_df(df: pandas.core.frame.DataFrame, inference_document: konfuzio_sdk.data.Document) → konfuzio_sdk.data.Document: Predict Labels from features.

feature_function(documents: List[konfuzio_sdk.data.Document], no_label_limit: Optional[Union[int, float]] = None, retokenize: Optional[bool] = None, require_revised_annotations: bool = False) → Tuple[List[pandas.core.frame.DataFrame], list]

Calculate features per Span of Annotations.

Parameters

documents – List of Documents to extract features from.
no_label_limit – Int or Float to limit number of new Annotations to create during tokenization.
retokenize – Bool for whether to recreate Annotations from scratch or use already existing Annotations.
require_revised_annotations – Only allow calculation of features if no unrevised Annotation present.

Returns

Dataframe of features and list of feature names.

features(document: konfuzio_sdk.data.Document): Calculate features using the best working default values that can be overwritten with self values.

filter_dataframe(df: pandas.core.frame.DataFrame) → pandas.core.frame.DataFrame

Filter dataframe rows accordingly with the confidence value.

Rows (extractions) where the accuracy value is below the threshold defined for the label are removed.

Parameters: df – Dataframe with extraction results
Returns: Filtered dataframe

filter_low_confidence_extractions(result: Dict) → Dict

Remove extractions with confidence below the threshold defined for the respective label.

The input is a dictionary where the values can be: - dataframe - dictionary where the values are dataframes - list of dictionaries where the values are dataframes

Parameters: result – Extraction results
Returns: Filtered dictionary.

fit() → sklearn.ensemble._forest.RandomForestClassifier: Given training data and the feature list this function returns the trained regression model.

label_train_document(virtual_document: konfuzio_sdk.data.Document, original_document: konfuzio_sdk.data.Document): Assign Labels to Annotations in newly tokenized virtual training Document.

merge_vertical(document: konfuzio_sdk.data.Document, only_multiline_labels=True)

Merge Annotations with the same Label.

See more details at https://dev.konfuzio.com/sdk/explanations.html#vertical-merge

Parameters

document – Document whose Annotations should be merged vertically
only_multiline_labels – Only merge if a multiline Label Annotation is in the Category Training set

merge_vertical_like(document: konfuzio_sdk.data.Document, template_document: konfuzio_sdk.data.Document)

Merge Annotations the same way as in another copy of the same Document.

All single-Span Annotations in the current Document (self) are matched with corresponding multi-line Spans in the given Document and are merged in the same way. The Label of the new multi-line Annotations is taken to be the most common Label among the original single-line Annotations that are being merged.

Parameters: document – Document with multi-line Annotations

reduce_model_weight(): Remove all non-strictly necessary parameters before saving.

remove_empty_dataframes_from_extraction(result: Dict) → Dict

Remove empty dataframes from the result of an Extraction AI.

The input is a dictionary where the values can be: - dataframe - dictionary where the values are dataframes - list of dictionaries where the values are dataframes

property requires_segmentation: bool: Return True if the Extraction AI requires detectron segmentation results to process Documents.

separate_labels(res_dict: Dict) → Dict

Undo the renaming of the labels.

In this way we have the output of the extraction in the correct format.

Categorization AI¶

[source]

Implements a Categorization Model.

Abstract Categorization AI¶

class konfuzio_sdk.trainer.document_categorization.AbstractCategorizationAI(categories: List[konfuzio_sdk.data.Category], *args, **kwargs)

Abstract definition of a CategorizationAI.

categorize(document: konfuzio_sdk.data.Document, recategorize: bool = False, inplace: bool = False) → konfuzio_sdk.data.Document

Run categorization on a Document.

Parameters

document – Input Document
recategorize – If the input Document is already categorized, the already present Category is used unless

this flag is True

Parameters: inplace – Option to categorize the provided Document in place, which would assign the Category attribute
Returns: Copy of the input Document with added CategoryAnnotation information

check_is_ready()

Check if Categorization AI instance is ready for inference.

It is assumed that the model is ready when there is at least one Category passed as the input.

Raises: AttributeError – When no Categories are passed into the model.

evaluate(use_training_docs: bool = False) → konfuzio_sdk.evaluate.CategorizationEvaluation

Evaluate the full Categorization pipeline on the pipeline’s Test Documents.

Parameters: use_training_docs – Bool for whether to evaluate on the Training Documents instead of Test Documents.
Returns: Evaluation object.

abstract fit() → None: Train the Categorization AI.

static has_compatible_interface(other)

Validate that an instance of a Categorization AI implements the same interface as AbstractCategorizationAI.

A Categorization AI should implement methods with the same signature as: - AbstractCategorizationAI.__init__ - AbstractCategorizationAI.fit - AbstractCategorizationAI._categorize_page - AbstractCategorizationAI.check_is_ready

Parameters: other – An instance of a Categorization AI to compare with.

static load_model(pickle_path: str, device='cpu')

Load the model and check if it has the interface compatible with the class.

Parameters

pickle_path (str) – Path to the pickled model.

Raises

FileNotFoundError – If the path is invalid.
OSError – When the data is corrupted or invalid and cannot be loaded.
TypeError – When the loaded pickle isn’t recognized as a Konfuzio AI model.

Returns

Categorization AI model.

name_lower(): Convert class name to machine-readable name.

property pkl_file_path: str

Generate a path for a resulting pickle file.

Returns: A string with the path.

abstract save(output_dir: str, include_konfuzio=True): Save the model to disk.

property temp_pkl_file_path: str

Generate a path for temporary pickle file.

Returns: A string with the path.

Name-based Categorization AI¶

class konfuzio_sdk.trainer.document_categorization.NameBasedCategorizationAI(categories: List[konfuzio_sdk.data.Category], *args, **kwargs)

A simple, non-trainable model that predicts a Category for a given Document based on a predefined rule.

It checks for whether the name of the Category is present in the input Document (case insensitive; also see Category.fallback_name). This can be an effective fallback logic to categorize Documents when no Categorization AI is available.

fit() → None: Use as placeholder Function because there’s no classifier to be trainer.

save(output_dir: str, include_konfuzio=True): Use as placeholder Function.

Model-based Categorization AI¶

class konfuzio_sdk.trainer.document_categorization.CategorizationAI(categories: List[konfuzio_sdk.data.Category], use_cuda: bool = False, *args, **kwargs)

A trainable AI that predicts a Category for each Page of a given Document.

build_document_classifier_iterator(documents, transforms, use_image: bool, use_text: bool, shuffle: bool, batch_size: int, max_len: int, device='cpu') → torch.utils.data.dataloader.DataLoader

Prepare the data necessary for the document classifier, and build the iterators for the data list.

For each document we split into pages and from each page we take:

the path to an image of the page
the tokenized and numericalized text on the page
the label (category) of the page
the id of the document
the page number

build_preprocessing_pipeline(use_image: bool, image_augmentation=None, image_preprocessing=None) → None: Set up the pre-processing and data augmentation when necessary.

build_template_category_vocab() → konfuzio_sdk.tokenizer.base.Vocab: Build a vocabulary over the Categories.

build_text_vocab(min_freq: int = 1, max_size: Optional[int] = None) → konfuzio_sdk.tokenizer.base.Vocab: Build a vocabulary over the document text.

property compressed_file_path: str

Generate a path for a resulting compressed file in .lz4 format.

Returns: A string with the path.

fit(max_len: Optional[bool] = None, batch_size: int = 1, **kwargs) → Dict[str, List[float]]: Fit the CategorizationAI classifier.

reduce_model_weight(): Reduce the size of the model by running lose_weight on the tokenizer.

save(output_dir: Optional[str] = None, reduce_weight: bool = True, **kwargs) → str

Save only the necessary parts of the model for extraction/inference.

Saves: - tokenizer (needed to ensure we tokenize inference examples in the same way that they are trained) - transforms (to ensure we transform/pre-process images in the same way as training) - vocabs (to ensure the tokens/labels are mapped to the same integers as training) - configs (to ensure we load the same models used in training) - state_dicts (the classifier parameters achieved through training)

Note: “path” is a deprecated parameter, “output_dir” is used for the sake of uniformity across all AIs.

Parameters

output_dir (str) – A path to save the model to.
reduce_weight (bool) – Reduces the weight of a model by removing Documents and reducing weight of a Tokenizer.

property temp_pt_file_path: str

Generate a path for s temporary model file in .pt format.

Returns: A string with the path.

Build a Model-based Categorization AI¶

konfuzio_sdk.trainer.document_categorization.build_categorization_ai_pipeline(categories: List[konfuzio_sdk.data.Category], documents: List[konfuzio_sdk.data.Document], test_documents: List[konfuzio_sdk.data.Document], tokenizer: Optional[konfuzio_sdk.tokenizer.base.AbstractTokenizer] = None, image_model_name: Optional[konfuzio_sdk.trainer.document_categorization.ImageModel] = None, text_model_name: Optional[konfuzio_sdk.trainer.document_categorization.TextModel] = TextModel.NBOW, **kwargs) → konfuzio_sdk.trainer.document_categorization.CategorizationAI

Build a Categorization AI neural network by choosing an ImageModel and a TextModel.

See an in-depth tutorial at https://dev.konfuzio.com/sdk/tutorials/data_validation/index.html

NBOW Model¶

class konfuzio_sdk.trainer.document_categorization.NBOW(input_dim: int, emb_dim: int = 64, dropout_rate: float = 0.0, **kwargs)

The neural bag-of-words (NBOW) model is the simplest of models, it passes each token through an embedding layer.

As shown in the fastText paper (https://arxiv.org/abs/1607.01759) this model is still able to achieve comparable performance to some deep learning models whilst being considerably faster.

One downside of this model is that tokens are embedded without regards to the surrounding context in which they appear, e.g. the embedding for “May” in the two sentences “May I speak to you?” and “I am leaving on the 1st of May” are identical, even though they have different semantics.

Parameters

emb_dim – The dimensions of the embedding vector.
dropout_rate – The amount of dropout applied to the embedding vectors.

NBOW Self Attention Model¶

class konfuzio_sdk.trainer.document_categorization.NBOWSelfAttention(input_dim: int, emb_dim: int = 64, n_heads: int = 8, dropout_rate: float = 0.0, **kwargs)

This is an NBOW model with a multi-headed self-attention layer, which is added after the embedding layer.

See details at https://arxiv.org/abs/1706.03762. The self-attention layer effectively contextualizes the output as now each hidden state is calculated from the embedding vector of a token and the embedding vector of all other tokens within the sequence.

Parameters

emb_dim – The dimensions of the embedding vector.
dropout_rate – The amount of dropout applied to the embedding vectors.
n_heads – The number of attention heads to use in the multi-headed self-attention layer. Note that n_heads

must be a factor of emb_dim, i.e. emb_dim % n_heads == 0.

LSTM Model¶

class konfuzio_sdk.trainer.document_categorization.LSTM(input_dim: int, emb_dim: int = 64, hid_dim: int = 256, n_layers: int = 2, bidirectional: bool = True, dropout_rate: float = 0.0, **kwargs)

The LSTM (long short-term memory) is a variant of an RNN (recurrent neural network).

It feeds the input tokens through an embedding layer and then processes them sequentially with the LSTM, outputting a hidden state for each token. If the LSTM is bidirectional then it trains a forward and backward LSTM per layer and concatenates the forward and backward hidden states for each token.

Parameters

emb_dim – The dimensions of the embedding vector.
hid_dim – The dimensions of the hidden states.
n_layers – How many LSTM layers to use.
bidirectional – If the LSTM should be bidirectional.
dropout_rate – The amount of dropout applied to the embedding vectors and between LSTM layers if

n_layers > 1.

BERT Model¶

class konfuzio_sdk.trainer.document_categorization.BERT(name: str = 'bert-base-german-cased', freeze: bool = False, **kwargs)

Wraps around pre-trained BERT-type models from the HuggingFace library.

BERT (bidirectional encoder representations from Transformers) is a family of large Transformer models. The available BERT variants are all pre-trained models provided by the transformers library. It is usually infeasible to train a BERT model from scratch due to the significant amount of computation required. However, the pre-trained models can be easily fine-tuned on desired data.

The BERT variants, i.e. name arguments, that are covered by internal tests are:

bert-base-german-cased
bert-base-german-dbmdz-cased
bert-base-german-dbmdz-uncased
distilbert-base-german-cased

In theory, all variants beginning with bert-base-* and distilbert-* should work out of the box. Other BERT variants come with no guarantees.

Parameters

name – The name of the pre-trained BERT variant to use.
freeze – Should the BERT model be frozen, i.e. the pre-trained parameters are not updated.

get_max_length(): Get the maximum length of a sequence that can be passed to the BERT module.

VGG Model¶

class konfuzio_sdk.trainer.document_categorization.VGG(name: str = 'vgg11', pretrained: bool = True, freeze: bool = True, **kwargs)

The VGG family of models are image classification models designed for the ImageNet.

They are usually used as a baseline in image classification tasks, however are considerably larger - in terms of the number of parameters - than modern architectures.

Available variants are: vgg11, vgg13, vgg16, vgg19, vgg11_bn, vgg13_bn, vgg16_bn, vgg19_bn. The number generally indicates the number of layers in the model, higher does not always mean better. The _bn suffix means that the VGG model uses Batch Normalization layers, this generally leads to better results.

The pre-trained weights are taken from the [torchvision](https://github.com/pytorch/vision) library and are weights from a model that has been trained as an image classifier on ImageNet. Ideally, this means the images should be 3-channel color images that are at least 224x224 pixels and should be normalized.

Parameters

name – The name of the VGG variant to use
pretrained – If pre-trained weights for the VGG variant should be used
freeze – If the parameters of the VGG variant should be frozen

EfficientNet Model¶

class konfuzio_sdk.trainer.document_categorization.EfficientNet(name: str = 'efficientnet_b0', pretrained: bool = True, freeze: bool = True, **kwargs)

EfficientNet is a family of convolutional neural network based models that are designed to be more efficient.

The efficiency comes in terms of the number of parameters and FLOPS, compared to previous computer vision models whilst maintaining equivalent image classification performance.

Available variants are: efficientnet_b0, efficientnet_b1, …, efficienet_b7. With b0 having the least amount of parameters and b7 having the most.

The pre-trained weights are taken from the timm library and have been trained on ImageNet, thus the same tips, i.e. normalization, that apply to the VGG models also apply here.

Parameters

name – The name of the EfficientNet variant to use
pretrained – If pre-trained weights for the EfficientNet variant should be used
freeze – If the parameters of the EfficientNet variant should be frozen

get_n_features() → int: Calculate number of output features based on given model.

Multimodal Concatenation¶

class konfuzio_sdk.trainer.document_categorization.MultimodalConcatenate(n_image_features: int, n_text_features: int, hid_dim: int = 256, output_dim: Optional[int] = None, **kwargs): Defines how the image and text features are combined in order to yield a categorization prediction.

File Splitting AI¶

[source]

Process Documents that consist of several files and propose splitting them into the Sub-Documents accordingly.

Abstract File Splitting Model¶

class konfuzio_sdk.trainer.file_splitting.AbstractFileSplittingModel(categories: List[konfuzio_sdk.data.Category], *args, **kwargs)

Abstract class for the File Splitting model.

abstract fit(*args, **kwargs): Fit the custom model on the training Documents.

static has_compatible_interface(other) → bool

Validate that an instance of a File Splitting Model implements the same interface as AbstractFileSplittingModel.

A File Splitting Model should implement methods with the same signature as: - AbstractFileSplittingModel.__init__ - AbstractFileSplittingModel.predict - AbstractFileSplittingModel.fit - AbstractFileSplittingModel.check_is_ready

Parameters: other – An instance of a File Splitting Model to compare with.

static load_model(pickle_path: str, max_ram: Optional[str] = None)

Load the model and check if it has the interface compatible with the class.

Parameters

pickle_path (str) – Path to the pickled model.

Raises

FileNotFoundError – If the path is invalid.
OSError – When the data is corrupted or invalid and cannot be loaded.
TypeError – When the loaded pickle isn’t recognized as a Konfuzio AI model.

Returns

File Splitting AI model.

property pkl_file_path: str

Generate a path for a resulting pickle file.

Returns: A string with the path.

abstract predict(page: konfuzio_sdk.data.Page) → konfuzio_sdk.data.Page

Take a Page as an input and reassign is_first_page attribute’s value if necessary.

Parameters: page (Page) – A Page to label first or non-first.
Returns: Page.

property temp_pkl_file_path: str

Generate a path for temporary pickle file.

Returns: A string with the path.

Context Aware File Splitting Model¶

class konfuzio_sdk.trainer.file_splitting.ContextAwareFileSplittingModel(categories: List[konfuzio_sdk.data.Category], tokenizer, *args, **kwargs)

A File Splitting Model that uses a context-aware logic.

Context-aware logic implies a rule-based approach that looks for common strings between the first Pages of all Category’s Documents.

check_is_ready()

Check File Splitting Model is ready for inference.

Raises

AttributeError – When no Tokenizer or no Categories were passed.
ValueError – When no Categories have _exclusive_first_page_strings.

fit(allow_empty_categories: bool = False, *args, **kwargs)

Gather the strings exclusive for first Pages in a given stream of Documents.

Exclusive means that each of these strings appear only on first Pages of Documents within a Category.

Parameters: allow_empty_categories – To allow returning empty list for a Category if no exclusive first-page strings

were found during fitting (which means prediction would be impossible for a Category). :type allow_empty_categories: bool :raises ValueError: When allow_empty_categories is False and no exclusive first-page strings were found for at least one Category.

predict(page: konfuzio_sdk.data.Page) → konfuzio_sdk.data.Page

Predict a Page as first or non-first.

Parameters: page (Page) – A Page to receive first or non-first label.
Returns: A Page with a newly predicted is_first_page attribute.

Multimodal File Splitting Model¶

class konfuzio_sdk.trainer.file_splitting.MultimodalFileSplittingModel(categories: List[konfuzio_sdk.data.Category], text_processing_model: str = 'nlpaueb/legal-bert-small-uncased', scale: int = 2, *args, **kwargs)

Split a multi-Document file into a list of shorter Documents based on model’s prediction.

We use an approach suggested by Guha et al.(2022) that incorporates steps for accepting separate visual and textual inputs and processing them independently via the VGG19 architecture and LegalBERT model which is essentially a BERT-type architecture trained on domain-specific data, and passing the resulting outputs together to a Multi-Layered Perceptron.

Guha, A., Alahmadi, A., Samanta, D., Khan, M. Z., & Alahmadi, A. H. (2022). A Multi-Modal Approach to Digital Document Stream Segmentation for Title Insurance Domain. https://ieeexplore.ieee.org/stamp/stamp.jsp?arnumber=9684474

check_is_ready()

Check if Multimodal File Splitting Model instance is ready for inference.

A method checks that the instance of the Model has at least one Category passed as the input and that it is fitted to run prediction.

Raises

AttributeError – When no Categories are passed to the model.
AttributeError – When a model is not fitted to run a prediction.

fit(epochs: int = 10, use_gpu: bool = False, train_batch_size=8, *args, **kwargs)

Process the train and test data, initialize and fit the model.

Parameters

epochs (int) – A number of epochs to train a model on.
use_gpu (bool) – Run training on GPU if available.

predict(page: konfuzio_sdk.data.Page, use_gpu: bool = False) → konfuzio_sdk.data.Page

Run prediction with the trained model.

Parameters

page (Page) – A Page to be predicted as first or non-first.
use_gpu (bool) – Run prediction on GPU if available.

Returns

A Page with possible changes in is_first_page attribute value.

reduce_model_weight(): Remove all non-strictly necessary parameters before saving.

remove_dependencies()

Remove dependencies before saving.

This is needed for proper saving of the model in lz4 compressed format – if the dependencies are not removed, the resulting pickle will be impossible to load.

restore_dependencies()

Restore removed dependencies after loading.

This is needed for proper functioning of a loaded model because we have previously removed these dependencies upon saving the model.

Textual File Splitting Model¶

class konfuzio_sdk.trainer.file_splitting.TextualFileSplittingModel(categories: List[konfuzio_sdk.data.Category], *args, **kwargs)

This model operates by taking input a multi-Document file and utilizing the DistilBERT model to make predictions regarding the segmentation of this document. Specifically, it aims to identify boundaries within the text where one document ends and another begins, effectively splitting the input into a list of shorter documents.

DistilBERT serves as the backbone of this model. DistilBERT offers a computationally efficient alternative to BERT, achieved through knowledge distillation while preserving much of BERT’s language understanding capabilities.

Sanh, V., Debut, L., Chaumond, J., & Wolf, T. (2019). DistilBERT, a distilled version of BERT: smaller, faster, cheaper and lighter https://arxiv.org/abs/1910.01108.

check_is_ready()

Check if Textual File Splitting Model instance is ready for inference.

A method checks that the instance of the Model has at least one Category passed as the input and that it is fitted to run prediction.

Raises

AttributeError – When no Categories are passed to the model.
AttributeError – When a model is not fitted to run a prediction.

fit(epochs: int = 5, eval_batch_size: int = 8, train_batch_size: int = 8, device: str = 'cpu', *args, **kwargs)

Process the train and test data, initialize and fit the model.

Parameters

epochs (int) – A number of epochs to train a model on.
eval_batch_size (int) – A batch size for evaluation.
train_batch_size (int) – A batch size for training.
device (str) – A device to run the prediction on. Possible values are ‘mps’, ‘cuda’, ‘cpu’.

Returns

A dictionary with evaluation results.

predict(page: konfuzio_sdk.data.Page, previous_page: Optional[konfuzio_sdk.data.Page] = None, device: str = 'cpu', *args, **kwargs) → konfuzio_sdk.data.Page

Run prediction with the trained model.

Parameters

page (Page) – A Page to be predicted as first or non-first.
previous_page – The previous Page which would help give more context to the model
device (str) – A device to run the prediction on. Possible values are ‘mps’, ‘cuda’, ‘cpu’.

Returns

A Page with possible changes in is_first_page attribute value.

reduce_model_weight(): Remove all non-strictly necessary parameters before saving.

static remove_dependencies()

Remove dependencies before saving.

This is needed for proper saving of the model in lz4 compressed format – if the dependencies are not removed, the resulting pickle will be impossible to load.

static restore_dependencies()

Restore removed dependencies after loading.

This is needed for proper functioning of a loaded model because we have previously removed these dependencies upon saving the model.

Splitting AI¶

class konfuzio_sdk.trainer.file_splitting.SplittingAI(model)

Split a given Document and return a list of resulting shorter Documents.

evaluate_full(use_training_docs: bool = False, zero_division='warn') → konfuzio_sdk.evaluate.FileSplittingEvaluation

Evaluate the Splitting AI’s performance.

Parameters: use_training_docs – If enabled, runs evaluation on the training data to define its quality; if disabled,

runs evaluation on the test data. :type use_training_docs: bool :param zero_division: Defines how to handle situations when precision, recall or F1 measure calculations result in zero division. Possible values: ‘warn’ – log a warning and assign a calculated metric a value of 0. 0 - assign a calculated metric a value of 0. ‘error’ – raise a ZeroDivisionError. None – assign None to a calculated metric. :return: Evaluation information for the model.

propose_split_documents(document: konfuzio_sdk.data.Document, return_pages: bool = False, inplace: bool = False, split_on_blank_pages: bool = False, device: str = 'cpu') → List[konfuzio_sdk.data.Document]

Propose a set of resulting Documents from a single Document.

Parameters

document (Document) – An input Document to be split.
inplace (bool) – Whether changes are applied to the input Document, changing it, or to a deepcopy of it.
return_pages – A flag to enable returning a copy of an old Document with Pages marked .is_first_page on

splitting points instead of a set of Sub-Documents. :type return_pages: bool :param split_on_blank_pages: A flag to enable splitting on blank Pages. :type split_on_blank_pages: bool :return: A list of suggested new Sub-Documents built from the original Document or a list with a Document with Pages marked .is_first_page on splitting points.

AI Evaluation¶

[source]

Extraction AI Evaluation¶

class konfuzio_sdk.evaluate.ExtractionEvaluation(documents: List[Tuple[konfuzio_sdk.data.Document, konfuzio_sdk.data.Document]], strict: bool = True, use_view_annotations: bool = True, ignore_below_threshold: bool = True, zero_division='warn')

Calculated accuracy measures by using the detailed comparison on Span Level.

calculate(): Calculate and update the data stored within this Evaluation.

clf_f1(search=None) → Optional[float]

Calculate the F1 Score of one the Label classifier.

Parameters: search – Parameter used to calculate the value for one Data object.

clf_fn(search=None) → int: Return the Label classifier False Negatives of all Spans.

clf_fp(search=None) → int: Return the Label classifier False Positives of all Spans.

clf_tp(search=None) → int: Return the Label classifier True Positives of all Spans.

f1(search=None) → Optional[float]

Calculate the F1 Score of one class.

Please note: As suggested by Opitz et al. (2021) use the arithmetic mean over individual F1 scores.

“F1 is often used with the intention to assign equal weight to frequent and infrequent classes, we recommend evaluating classifiers with F1 (the arithmetic mean over individual F1 scores), which is significantly more robust towards the error type distribution.”

Opitz, Juri, and Sebastian Burst. “Macro F1 and Macro F1.” arXiv preprint arXiv:1911.03347 (2021). https://arxiv.org/pdf/1911.03347.pdf

Parameters: search – Parameter used to calculate the value for one class.

Example:

If you have three Documents, calculate the F-1 Score per Document and use the arithmetic mean.
If you have three Labels, calculate the F-1 Score per Label and use the arithmetic mean.
If you have three Labels and three documents, calculate six F-1 Scores and use the arithmetic mean.

fn(search=None) → int: Return the False Negatives of all Spans.

fp(search=None) → int: Return the False Positives of all Spans.

get_evaluation_data(search, allow_zero: bool = True) → konfuzio_sdk.evaluate.EvaluationCalculator: Get precision, recall, f1, based on TP, FP, FN.

get_missing_vertical_merge(): Return Spans that should have been merged.

get_wrong_vertical_merge(): Return Spans that were wrongly merged vertically.

precision(search=None) → Optional[float]: Calculate the Precision and see f1 to calculate imbalanced classes.

recall(search=None) → Optional[float]: Calculate the Recall and see f1 to calculate imbalanced classes.

tn(search=None) → int: Return the True Negatives of all Spans.

tokenizer_f1(search=None) → Optional[float]

Calculate the F1 Score of one the tokenizer.

Parameters: search – Parameter used to calculate the value for one Data object.

tokenizer_fn(search=None) → int: Return the tokenizer False Negatives of all Spans.

tokenizer_fp(search=None) → int: Return the tokenizer False Positives of all Spans.

tokenizer_tp(search=None) → int: Return the tokenizer True Positives of all Spans.

tp(search=None) → int: Return the True Positives of all Spans.

Categorization AI Evaluation¶

class konfuzio_sdk.evaluate.CategorizationEvaluation(categories: List[konfuzio_sdk.data.Category], documents: List[Tuple[konfuzio_sdk.data.Document, konfuzio_sdk.data.Document]], zero_division='warn')

Calculated evaluation measures for the classification task of Document categorization.

property actual_classes: List[int]: List of ground truth Category IDs.

calculate(): Calculate and update the data stored within this Evaluation.

property category_ids: List[int]: List of Category IDs as class labels.

property category_names: List[str]: List of Category names as class names.

confusion_matrix() → pandas.core.frame.DataFrame: Confusion matrix.

f1(category: Optional[konfuzio_sdk.data.Category]) → Optional[float]: Calculate the global F1 Score or filter it by one Category.

fn(category: Optional[konfuzio_sdk.data.Category] = None) → int: Return the False Negatives of all Documents.

fp(category: Optional[konfuzio_sdk.data.Category] = None) → int: Return the False Positives of all Documents.

get_evaluation_data(search: Optional[konfuzio_sdk.data.Category] = None, allow_zero: bool = True) → konfuzio_sdk.evaluate.EvaluationCalculator

Get precision, recall, f1, based on TP, TN, FP, FN.

Parameters

search (Category) – A Category to filter for, or None for getting global evaluation results.
allow_zero – If true, will calculate None for precision and recall when the straightforward application

of the formula would otherwise result in 0/0. Raises ZeroDivisionError otherwise. :type allow_zero: bool

precision(category: Optional[konfuzio_sdk.data.Category]) → Optional[float]: Calculate the global Precision or filter it by one Category.

property predicted_classes: List[int]: List of predicted Category IDs.

recall(category: Optional[konfuzio_sdk.data.Category]) → Optional[float]: Calculate the global Recall or filter it by one Category.

tn(category: Optional[konfuzio_sdk.data.Category] = None) → int: Return the True Negatives of all Documents.

tp(category: Optional[konfuzio_sdk.data.Category] = None) → int: Return the True Positives of all Documents.

File Splitting AI Evaluation¶

class konfuzio_sdk.evaluate.FileSplittingEvaluation(ground_truth_documents: List[konfuzio_sdk.data.Document], prediction_documents: List[konfuzio_sdk.data.Document], zero_division='warn')

Evaluate the quality of the filesplitting logic.

calculate(): Calculate metrics for the File Splitting logic.

calculate_metrics_by_category(): Calculate metrics by Category independently.

f1(search: Optional[konfuzio_sdk.data.Category] = None) → float

Return F1-measure.

Parameters: search (Category) – display F1 measure within a certain Category.
Raises: KeyError – When the Category in search is not present in the Project from which the Documents are.

fn(search: Optional[konfuzio_sdk.data.Category] = None) → int

Return first Pages incorrectly predicted as non-first.

Parameters: search (Category) – display false negatives within a certain Category.
Raises: KeyError – When the Category in search is not present in the Project from which the Documents are.

fp(search: Optional[konfuzio_sdk.data.Category] = None) → int

Return non-first Pages incorrectly predicted as first.

Parameters: search (Category) – display false positives within a certain Category.
Raises: KeyError – When the Category in search is not present in the Project from which the Documents are.

get_evaluation_data(search: Optional[konfuzio_sdk.data.Category] = None, allow_zero: bool = True) → konfuzio_sdk.evaluate.EvaluationCalculator

Get precision, recall, f1, based on TP, TN, FP, FN.

Parameters

search (Category) – display true positives within a certain Category.
allow_zero – If true, will calculate None for precision and recall when the straightforward application

of the formula would otherwise result in 0/0. Raises ZeroDivisionError otherwise. :type allow_zero: bool

precision(search: Optional[konfuzio_sdk.data.Category] = None) → float

Return precision.

Parameters: search (Category) – display precision within a certain Category.
Raises: KeyError – When the Category in search is not present in the Project from which the Documents are.

recall(search: Optional[konfuzio_sdk.data.Category] = None) → float

Return recall.

Parameters: search (Category) – display recall within a certain Category.
Raises: KeyError – When the Category in search is not present in the Project from which the Documents are.

tn(search: Optional[konfuzio_sdk.data.Category] = None) → int

Return non-first Pages predicted as non-first.

Parameters: search (Category) – display true negatives within a certain Category.
Raises: KeyError – When the Category in search is not present in the Project from which the Documents are.

tp(search: Optional[konfuzio_sdk.data.Category] = None) → int

Return correctly predicted first Pages.

Parameters: search (Category) – display true positives within a certain Category.
Raises: KeyError – When the Category in search is not present in the Project from which the Documents are.

Evaluation Calculator¶

class konfuzio_sdk.evaluate.EvaluationCalculator(tp: int = 0, fp: int = 0, fn: int = 0, tn: int = 0, zero_division='warn')

Calculate precision, recall, f1, based on TP, FP, FN.

property f1: Optional[float]

Apply F1-score formula.

Raises: ZeroDivisionError – When precision and recall are 0 and zero_division is set to ‘error’

metrics_logging(): Log metrics.

property precision: Optional[float]

Apply precision formula.

Raises: ZeroDivisionError – When TP and FP are 0 and zero_division is set to ‘error’

property recall: Optional[float]

Apply recall formula.

Raises: ZeroDivisionError – When TP and FN are 0 and zero_division is set to ‘error’

konfuzio_sdk.evaluate.grouped(group, target: str)¶: Define which of the correct element in the predicted group defines the “correct” group id_.

konfuzio_sdk.evaluate.compare(doc_a, doc_b, only_use_correct=False, use_view_annotations=False, ignore_below_threshold=False, strict=True) → pandas.core.frame.DataFrame¶

Compare the Annotations of two potentially empty Documents wrt. to all Annotations.

Parameters

doc_a – Document which is assumed to be correct
doc_b – Document which needs to be evaluated
only_use_correct – Unrevised feedback in doc_a is assumed to be correct.
use_view_annotations – Will filter for top confidence annotations. Only available when strict=True. When use_view_annotations=True, it will compare only the highest confidence extractions to the ground truth Annotations. When False (default), it compares all extractions to the ground truth Annotations. This setting is ignored when strict=False, as the Non-Strict Evaluation needs to compare all extractions. For more details see https://help.konfuzio.com/modules/extractions/index.html#evaluation
ignore_below_threshold – Ignore Annotations below detection threshold of the Label (only affects TNs)
strict – Evaluate on a Character exact level without any postprocessing, an amount Span “5,55 ” will not be exact with “5,55”

Raises

ValueError – When the Category differs.

Returns

Evaluation DataFrame

Trainer utils¶

[source]

Add utility common functions and classes to be used for AI Training.

LoggerCallback¶

class konfuzio_sdk.trainer.utils.LoggerCallback

Custom callback for logger.info to be used in Trainer.

This callback is called by Trainer at the end of every epoch to log metrics. It replaces calling print and tqdm and calls logger.info instead.

on_log(args, state, control, logs=None, **kwargs): Log losses and metrics when training or evaluating using Trainer.

BalancedLossTrainer¶

class konfuzio_sdk.trainer.utils.BalancedLossTrainer(model: Optional[Union[transformers.modeling_utils.PreTrainedModel, torch.nn.modules.module.Module]] = None, args: Optional[transformers.training_args.TrainingArguments] = None, data_collator: Optional[DataCollator] = None, train_dataset: Optional[torch.utils.data.dataset.Dataset] = None, eval_dataset: Optional[Union[torch.utils.data.dataset.Dataset, Dict[str, torch.utils.data.dataset.Dataset]]] = None, tokenizer: Optional[transformers.tokenization_utils_base.PreTrainedTokenizerBase] = None, model_init: Optional[Callable[[], transformers.modeling_utils.PreTrainedModel]] = None, compute_metrics: Optional[Callable[[transformers.trainer_utils.EvalPrediction], Dict]] = None, callbacks: Optional[List[transformers.trainer_callback.TrainerCallback]] = None, optimizers: Tuple[torch.optim.optimizer.Optimizer, torch.optim.lr_scheduler.LambdaLR] = (None, None), preprocess_logits_for_metrics: Optional[Callable[[torch.Tensor, torch.Tensor], torch.Tensor]] = None)

Custom trainer with custom loss to leverage class weights.

compute_loss(model, inputs, return_outputs=False): Compute weighted cross-entropy loss to recompensate for unbalanced datasets.

log(logs: Dict[str, float]) → None

Log logs on the various objects watching training.

Subclass and override this method to inject custom behavior.

param: logs (Dict[str, float]): The values to log.