API Reference

Reference guides are technical descriptions of the machinery and how to operate it. Reference material is information-oriented.

Data

[source]

Handle data from the API.

Span

class konfuzio_sdk.data.Span(start_offset: int, end_offset: int, annotation: Optional[konfuzio_sdk.data.Annotation] = None, document: Optional[konfuzio_sdk.data.Document] = None, strict_validation: bool = True)

A Span is a sequence of characters or whitespaces without line break.

For more details see https://dev.konfuzio.com/sdk/explanations.html#span-concept

bbox() konfuzio_sdk.data.Bbox

Calculate the bounding box of a text sequence.

bbox_dict() Dict

Return Span Bbox info as a serializable Dict format for external integration with the Konfuzio Server.

eval_dict()

Return any information needed to evaluate the Span.

static get_sentence_from_spans(spans: Iterable[konfuzio_sdk.data.Span], punctuation=None) List[List[konfuzio_sdk.data.Span]]

Return a list of Spans corresponding to Sentences separated by Punctuation.

property line_index: int

Return index of the line of the Span.

property normalized

Normalize the offset string.

property offset_string: Optional[str]

Calculate the offset string of a Span.

property page: konfuzio_sdk.data.Page

Return Page of Span.

regex()

Suggest a Regex for the offset string.

Bbox

class konfuzio_sdk.data.Bbox(x0: int, x1: int, y0: int, y1: int, page: konfuzio_sdk.data.Page, validation=BboxValidationTypes.ALLOW_ZERO_SIZE)

A bounding box relates to an area of a Document Page.

For more details see https://dev.konfuzio.com/sdk/explanations.html#bbox-concept

What consistutes a valid Bbox changes depending on the value of the validation param. If ALLOW_ZERO_SIZE (default), it allows bounding boxes to have zero width or height. This option is available for compatibility reasons since some OCR engines can sometimes return character level bboxes with zero width or height. If STRICT, it doesn’t allow zero size bboxes. If DISABLED, it allows bboxes that have negative size, or coordinates beyond the Page bounds. For the default behaviour see https://dev.konfuzio.com/sdk/tutorials/data_validation/index.html

Parameters

validation – One of ALLOW_ZERO_SIZE (default), STRICT, or DISABLED.

property area

Return area covered by the Bbox.

check_overlap(bbox: Union[konfuzio_sdk.data.Bbox, Dict]) bool

Verify if there’s overlap between two Bboxes.

property document: konfuzio_sdk.data.Document

Get the Document the Bbox belongs to.

classmethod from_image_size(x0, x1, y0, y1, page: konfuzio_sdk.data.Page) konfuzio_sdk.data.Bbox

Create a Bbox from image dimensions, based on the scaling of character Bboxes within the Document.

This method computes the coordinates of the bottom-left and top-right corners in a coordinate system where the y-axis is oriented from bottom to top, the x-axis is from left to right, and the scale is based on the page.

Parameters
  • x0 – The x-coordinate of the top-left corner in an image-scaled system.

  • x1 – The x-coordinate of the bottom-right corner in an image-scaled system.

  • y0 – The y-coordinate of the top-left corner in an image-scaled system.

  • y1 – The y-coordinate of the bottom-right corner in an image-scaled system.

  • page – The Page object for reference in scaling.

Returns

A Bbox object with rescaled dimensions.

property top

Calculate the distance to the top of the Page.

property x0_image

Get the x0 coordinate in the context of the Page image.

property x1_image

Get the x1 coordinate in the context of the Page image.

property y0_image

Get the y0 coordinate in the context of the Page image, in a top-down coordinate system.

property y1_image

Get the y1 coordinate in the context of the Page image, in a top-down coordinate system.

Annotation

class konfuzio_sdk.data.Annotation(document: konfuzio_sdk.data.Document, annotation_set_id: Optional[int] = None, annotation_set: Optional[konfuzio_sdk.data.AnnotationSet] = None, label: Optional[Union[int, konfuzio_sdk.data.Label]] = None, label_set_id: Optional[int] = None, label_set: Union[None, konfuzio_sdk.data.LabelSet] = None, is_correct: bool = False, revised: bool = False, normalized=None, id_: Optional[int] = None, spans=None, accuracy: Optional[float] = None, confidence: Optional[float] = None, created_by: Optional[int] = None, revised_by: Optional[int] = None, translated_string: Optional[str] = None, custom_offset_string: bool = False, offset_string: Optional[str] = None, *args, **kwargs)

Hold information that a Label, Label Set and Annotation Set has been assigned to and combines Spans.

For more details see https://dev.konfuzio.com/sdk/explanations.html#annotation-concept

add_span(span: konfuzio_sdk.data.Span)

Add a Span to an Annotation incl. a duplicate check per Annotation.

bbox() konfuzio_sdk.data.Bbox

Get Bbox encompassing all Annotation Spans.

property bboxes: List[Dict]

Return the Bbox information for all Spans in serialized format.

This is useful for external integration (e.g. Konfuzio Server).”

delete(delete_online: bool = True) None

Delete Annotation.

Parameters

delete_online – Whether the Annotation is deleted online or only locally.

property end_offset: int

Legacy: One Annotation can have multiple end offsets.

property eval_dict: List[dict]

Calculate the Span information to evaluate the Annotation.

get_link()

Get link to the Annotation in the SmartView.

property is_multiline: int

Calculate if Annotation spans multiple lines of text.

property label_set: konfuzio_sdk.data.LabelSet

Return Label Set of Annotation.

lose_weight()

Delete data of the instance.

property normalize: str

Provide one normalized offset string due to legacy.

property offset_string: List[str]

View the string representation of the Annotation.

property page: konfuzio_sdk.data.Page

Return Page of Annotation.

regex()

Return regex of this Annotation.

regex_annotation_generator(regex_list) List[konfuzio_sdk.data.Span]

Build Spans without Labels by regexes.

Returns

Return sorted list of Spans by start_offset

save(label_set_id=None, annotation_set_id=None, document_annotations: Optional[list] = None) bool

Save Annotation online.

If there is already an Annotation in the same place as the current one, we will not be able to save the current annotation.

In that case, we get the id_ of the original one to be able to track it. The verification of the duplicates is done by checking if the offsets and Label match with any Annotations online. To be sure that we are comparing with the information online, we need to have the Document updated. The update can be done after the request (per annotation) or the updated Annotations can be passed as input of the function (advisable when dealing with big Documents or Documents with many Annotations).

Specify label_set_id if you want to create an Annotation belonging to a new Annotation Set. Specify annotation_set_id if you want to add an Annotation to an existing Annotation Set. Do not specify both of them.

Parameters

document_annotations – Annotations in the Document (list)

Returns

True if new Annotation was created

property spans: List[konfuzio_sdk.data.Span]

Return default entry to get all Spans of the Annotation.

property start_offset: int

Legacy: One Annotation can have multiple start offsets.

token_append(new_regex, regex_quality: int)

Append token if it is not a duplicate.

tokens() List[str]

Create a list of potential tokens based on Spans of this Annotation.

Annotation Set

class konfuzio_sdk.data.AnnotationSet(document, label_set: konfuzio_sdk.data.LabelSet, id_: Optional[int] = None, **kwargs)

An Annotation Set is a group of Annotations. The Labels of those Annotations refer to the same Label Set.

For more details see https://dev.konfuzio.com/sdk/explanations.html#annotation-set-concept

annotations(use_correct: bool = True, ignore_below_threshold: bool = False)

All Annotations currently in this Annotation Set.

property end_line_index: Optional[int]

Calculate ending line of this Annotation Set.

property end_offset: Optional[int]

Calculate the end based on all Annotations above detection threshold currently in this AnnotationSet.

property is_default: bool

Check if AnnotationSet is the default AnnotationSet of the Document.

property start_line_index: Optional[int]

Calculate starting line of this Annotation Set.

property start_offset: Optional[int]

Calculate the earliest start based on all Annotations above detection threshold in this AnnotationSet.

Label

class konfuzio_sdk.data.Label(project: konfuzio_sdk.data.Project, id_: Optional[int] = None, text: Optional[str] = None, get_data_type_display: str = 'Text', text_clean: Optional[str] = None, description: Optional[str] = None, label_sets=None, has_multiple_top_candidates: bool = False, threshold: float = 0.1, *initial_data, **kwargs)

Group Annotations across Label Sets.

For more details see https://dev.konfuzio.com/sdk/explanations.html#label-concept

add_label_set(label_set: konfuzio_sdk.data.LabelSet)

Add Label Set to label, if it does not exist.

Parameters

label_set – Label Set to add

annotations(categories: List[konfuzio_sdk.data.Category], use_correct=True, ignore_below_threshold=False) List[konfuzio_sdk.data.Annotation]

Return related Annotations. Consider that one Label can be used across Label Sets in multiple Categories.

base_regex(category: konfuzio_sdk.data.Category, annotations: Optional[List[konfuzio_sdk.data.Annotation]] = None) str

Find the best combination of regex in the list of all regex proposed by Annotations.

evaluate_regex(regex, category: konfuzio_sdk.data.Category, annotations: Optional[List[konfuzio_sdk.data.Annotation]] = None, regex_quality=0)

Evaluate a regex on Categories.

Type of regex allows you to group regex by generality

Example:

Three Annotations about the birthdate in two Documents and one regex to be evaluated 1.doc: “My was born on the 12th of December 1980, you could also say 12.12.1980.” (2 Annotations) 2.doc: “My was born on 12.06.1997.” (1 Annotations) regex: dd.dd.dddd (without escaped characters for easier reading) stats: - total_correct_findings: 2 - correct_label_annotations: 3 - total_findings: 2 –> precision 100 % - num_docs_matched: 2 - Project.documents: 2 –> Document recall 100%

find_regex(category: konfuzio_sdk.data.Category, max_findings_per_page=100) List[str]

Find the best combination of regex for Label with before and after context.

get_probable_outliers(categories: List[konfuzio_sdk.data.Category], regex_search: bool = True, regex_worst_percentage: float = 0.1, confidence_search: bool = True, evaluation_data=None, normalization_search: bool = True) List[konfuzio_sdk.data.Annotation]

Get a list of Annotations that are outliers.

Outliers are determined by either of three logics or a combination of them applied: found by the worst regex, have the lowest confidence and/or are not normalizeable by the data type of a given Label.

Parameters
  • categories (List[Category]) – Categories under which the search is done.

  • regex_search (bool) – Enable search by top worst regexes.

  • regex_worst_percentage (float) – A % of Annotations returned by the regexes.

  • confidence_search (bool) – Enable search by the lowest-confidence Annotations.

  • normalization_search (bool) – Enable search by normalizing Annotations by the Label’s data type.

Raises

ValueError – When all search options are disabled.

get_probable_outliers_by_confidence(evaluation_data, confidence: float = 0.5) List[konfuzio_sdk.data.Annotation]

Get a list of Annotations with the lowest confidence.

A method iterates over the list of Categories, returning the top N Annotations with the lowest confidence score.

Parameters
  • evaluation_data (ExtractionEvaluation instance) – An instance of the ExtractionEvaluation class that contains predicted confidence scores.

  • confidence (float) – A level of confidence below which the Annotations are returned.

get_probable_outliers_by_normalization(categories: List[konfuzio_sdk.data.Category]) List[konfuzio_sdk.data.Annotation]

Get a list of Annotations that do not pass normalization by the data type.

A method iterates over the list of Categories, returning the Annotations that do not fit into the data type of a Label (= have None returned in an attempt of the normalization by the Label’s data type).

Parameters

categories (List[Category]) – Categories under which the search is done.

get_probable_outliers_by_regex(categories: List[konfuzio_sdk.data.Category], use_test_docs: bool = False, top_worst_percentage: float = 0.1) List[konfuzio_sdk.data.Annotation]

Get a list of Annotations that are identified by the least precise regular expressions.

This method iterates over the list of Categories and Annotations within each Category, collecting all the regexes associated with them. It then evaluates these regexes and collects the top worst ones (i.e., those with the least True Positives). For each of these top worst regexes, it returns the Annotations found by them but not by the best regex for that label, potentially identifying them as outliers.

To detect outlier Annotations with multi-Spans, the method iterates over all the multi-Span Annotations under the Label and checks each Span that was not detected by the aforementioned worst regexes. If it is not found by any other regex in the Project, the entire Annotation is considered a potential outlier.

Parameters
  • categories (List[Category]) – A list of Category objects under which the search is conducted.

  • use_test_docs (bool) – Indicates whether the evaluation of the regular expressions occurs on test Documents or training Documents.

  • top_worst_percentage (float) – A threshold for determining what percentage of the worst regexes’ output to return.

Returns

A list of Annotation objects identified by the least precise regular expressions.

Return type

List[Annotation]

has_multiline_annotations(categories: Optional[List[konfuzio_sdk.data.Category]] = None) bool

Return if any Label annotations are multi-line.

lose_weight()

Delete data of the instance.

regex(categories: List[konfuzio_sdk.data.Category], update=False) Dict

Calculate regex to be used in the Extraction AI.

spans(categories: List[konfuzio_sdk.data.Category], use_correct=True, ignore_below_threshold=False) List[konfuzio_sdk.data.Span]

Return all Spans belonging to an Annotation of this Label.

spans_not_found_by_tokenizer(tokenizer, categories: List[konfuzio_sdk.data.Category], use_correct=False) List[konfuzio_sdk.data.Span]

Find Label Spans that are not found by a tokenizer.

Label Set

class konfuzio_sdk.data.LabelSet(project, labels=None, id_: Optional[int] = None, name: Optional[str] = None, name_clean: Optional[str] = None, is_default=False, categories=None, has_multiple_annotation_sets=False, **kwargs)

A Label Set is a group of Labels.

For more details see https://dev.konfuzio.com/sdk/explanations.html#label-set-concept

add_category(category: konfuzio_sdk.data.Category)

Add Category to the Label Set, if it does not exist.

Parameters

category – Category to add to the Label Set

add_label(label)

Add Label to Label Set, if it does not exist.

Parameters

label – Label ID to be added

get_target_names(use_separate_labels: bool)

Get target string name for Annotation Label classification.

Category

class konfuzio_sdk.data.Category(project, id_: Optional[int] = None, name: Optional[str] = None, name_clean: Optional[str] = None, *args, **kwargs)

Group Documents in a Project.

For more details see https://dev.konfuzio.com/sdk/explanations.html#category-concept

add_label_set(label_set)

Add Label Set to Category.

property default_label_set

Get the default Label Set of the Category.

documents()

Filter for Documents of this Category.

exclusive_first_page_strings(tokenizer) set

Return a set of strings exclusive for first Pages of Documents within the Category.

Parameters

tokenizer – A tokenizer to process Documents before gathering strings.

property fallback_name: str

Turn the Category name to lowercase, remove parentheses along with their contents, and trim spaces.

property labels

Return the Labels that belong to the Category and its Label Sets.

test_documents()

Filter for test Documents of this Category.

Category Annotation

class konfuzio_sdk.data.CategoryAnnotation(category: konfuzio_sdk.data.Category, confidence: Optional[float] = None, page: Optional[konfuzio_sdk.data.Page] = None, document: Optional[konfuzio_sdk.data.Document] = None, id_: Optional[int] = None)

Annotate the Category of a Page.

For more details see https://dev.konfuzio.com/sdk/explanations.html#category-annotation-concept

property confidence: float

Get the confidence of this Category Annotation.

If the confidence was not set, it means it was never predicted by an AI. Thus, the returned value will be 0, unless it was set by a human, in which case it defaults to 1.

Returns

Confidence between 0.0 and 1.0 included.

set_revised() None

Set this Category Annotation as revised by human, and thus the correct one for the linked Page.

Document

class konfuzio_sdk.data.Document(project: konfuzio_sdk.data.Project, id_: Optional[int] = None, file_url: Optional[str] = None, status: Optional[List[Union[int, str]]] = None, data_file_name: Optional[str] = None, is_dataset: Optional[bool] = None, dataset_status: Optional[int] = None, updated_at: Optional[str] = None, assignee: Optional[int] = None, category_template: Optional[int] = None, category: Optional[konfuzio_sdk.data.Category] = None, category_confidence: Optional[float] = None, category_is_revised: bool = False, text: Optional[str] = None, bbox: Optional[dict] = None, bbox_validation_type=None, pages: Optional[list] = None, update: bool = False, copy_of_id: Optional[int] = None, *args, **kwargs)

Access the information about one Document, which is available online.

For more details see https://dev.konfuzio.com/sdk/explanations.html#document-concept

add_annotation(annotation: konfuzio_sdk.data.Annotation)

Add an Annotation to a Document.

The Annotation is only added to the Document if the data validation tests are passing for this Annotation. See https://dev.konfuzio.com/sdk/tutorials/data_validation/index.html

Parameters

annotation – Annotation to add in the Document

Returns

Input Annotation.

add_annotation_set(annotation_set: konfuzio_sdk.data.AnnotationSet)

Add the Annotation Sets to the Document.

add_page(page: konfuzio_sdk.data.Page)

Add a Page to a Document.

annotation_sets(label_set: Optional[konfuzio_sdk.data.LabelSet] = None) List[konfuzio_sdk.data.AnnotationSet]

Return Annotation Sets of Documents.

Parameters

label_set – Label Set for which to filter the Annotation Sets.

Returns

Annotation Sets of Documents.

annotations(label: Optional[konfuzio_sdk.data.Label] = None, use_correct: bool = True, ignore_below_threshold: bool = False, start_offset: int = 0, end_offset: Optional[int] = None, fill: bool = False) List[konfuzio_sdk.data.Annotation]

Filter available annotations.

Parameters
  • label – Label for which to filter the Annotations.

  • use_correct – If to filter by correct Annotations.

  • ignore_below_threshold – To filter out Annotations with confidence below Label prediction threshold.

Returns

Annotations in the document.

property bboxes: Dict[int, konfuzio_sdk.data.Bbox]

Use the cached bbox version.

property category: konfuzio_sdk.data.Category

Return the Category of the Document.

The Category of a Document is only defined as long as all Pages have the same Category. Otherwise, the Document should probably be split into multiple Documents with a consistent Category assignment within their Pages, or the Category for each Page should be manually revised.

property category_annotations: List[konfuzio_sdk.data.CategoryAnnotation]

Collect Category Annotations and average confidence across all Pages.

Returns

List of Category Annotations, one for each Category.

check_annotations(update_document: bool = False) bool

Check if Annotations are valid - no duplicates and correct Category.

check_bbox() None

Run validation checks on the Document text and bboxes.

This is run when the Document is initialized, and usually it’s not needed to be run again because a Document’s text and bboxes are not expected to change within the Konfuzio Server.

You can run this manually instead if your pipeline allows changing the text or the bbox during the lifetime of a document. Will raise ValueError if the bboxes don’t match with the text of the document, or if bboxes have invalid coordinates (outside page borders) or invalid size (negative width or height).

This check is usually slow, and it can be made faster by calling Document.set_text_bbox_hashes() right after initializing the Document, which will enable running a hash comparison during this check.

create_subdocument_from_page_range(start_page: konfuzio_sdk.data.Page, end_page: konfuzio_sdk.data.Page, include=False)

Create a shorter Document from a Page range of an initial Document.

Parameters
  • start_page (Page) – A Page that the new sub-Document starts with.

  • end_page (Page) – A Page that the new sub-Document ends with, if include is True.

  • include (bool) – Whether end_page is included into the new sub-Document.

Returns

A new sub-Document.

property default_annotation_set: konfuzio_sdk.data.AnnotationSet

Return the default Annotation Set of the Document.

delete(delete_online: bool = False)

Delete Document.

delete_document_details()

Delete all local content information for the Document.

property document_folder

Get the path to the folder where all the Document information is cached locally.

download_document_details()

Retrieve data from a Document online in case Document has finished processing.

Data includes Document’s status, URL of its file, name of its file, date of las update, its text and pagination, Annotations and Annotation Sets; optionally, Category information.

eval_dict(use_view_annotations=False, use_correct=False, ignore_below_threshold=False) List[dict]

Use this dict to evaluate Documents. The speciality: For every Span of an Annotation create one entry.

evaluate_regex(regex, label: konfuzio_sdk.data.Label, annotations: Optional[List[konfuzio_sdk.data.Annotation]] = None)

Evaluate a regex based on the Document.

property file_path

Return path to file.

classmethod from_file(path: str, project: konfuzio_sdk.data.Project, dataset_status: int = 0, category_id: Optional[int] = None, callback_url: str = '', timeout: Optional[int] = None, sync: bool = True) konfuzio_sdk.data.Document

Initialize Document from file with synchronous API call.

This class method will wait for the document to be processed by the server and then return the new Document. This may take a bit of time. When uploading many Documents, it is advised to set the sync option to False method.

Parameters
  • path – Path to file to be uploaded

  • project – If to filter by correct annotations

  • dataset_status – Dataset status of the Document (None: 0 Preparation: 1 Training: 2 Test: 3 Excluded: 4)

  • category_id – Category the Document belongs to (if unset, it will be assigned one by the server)

  • callback_url – Callback URL receiving POST call once extraction is done

  • timeout – Number of seconds to wait for response from the server

  • sync – Whether to wait for the file to be processed by the server

Returns

New Document

get_annotation_by_id(annotation_id: int) konfuzio_sdk.data.Annotation

Return an Annotation by ID, searching within the Document.

Parameters

annotation_id – ID of the Annotation to get.

get_annotation_set_by_id(id_: int) konfuzio_sdk.data.AnnotationSet

Return an Annotation Set by ID.

Parameters

id – ID of the Annotation Set to get.

get_annotations() List[konfuzio_sdk.data.Annotation]

Get Annotations of the Document.

get_bbox() Dict

Get bbox information per character of file. We don’t store bbox as an attribute to save memory.

Returns

Bounding box information per character in the Document.

get_bbox_by_page(page_index: int) Dict[str, Dict]

Return list of all bboxes in a Page.

get_file(ocr_version: bool = True, update: bool = False)

Get OCR version of the original file.

Parameters
  • ocr_version – Bool to get the ocr version of the original file

  • update – Update the downloaded file even if it is already available

Returns

Path to the selected file.

get_images(update: bool = False)

Get Document Pages as PNG images.

Parameters

update – Update the downloaded images even they are already available

Returns

Path to PNG files.

get_page_by_id(page_id: int, original: bool = False) konfuzio_sdk.data.Page

Get a Page by its ID.

Parameters

page_id (int) – An ID of the Page to fetch.

get_page_by_index(page_index: int)

Return the Page by index.

get_segmentation(timeout: Optional[int] = None, num_retries: Optional[int] = None) List

Retrieve the segmentation results for the Document.

Parameters
  • timeout – Number of seconds to wait for response from the server.

  • num_retries – Number of retries if the request fails.

Returns

A list of segmentation results for each Page in the Document.

get_text_in_bio_scheme(update=False) List[Tuple[str, str]]

Get the text of the Document in the BIO scheme.

Parameters

update – Update the bio annotations even they are already available

Returns

list of tuples with each word in the text and the respective label

lose_weight()

Remove NO_LABEL, wrong and below threshold Annotations.

property maximum_confidence_category: Optional[konfuzio_sdk.data.Category]

Get the human revised Category of this Document, or the highest confidence one if not revised.

Returns

The found Category, or None if not present.

property maximum_confidence_category_annotation: Optional[konfuzio_sdk.data.CategoryAnnotation]

Get the human revised Category Annotation of this Document, or the highest confidence one if not revised.

Returns

The found Category Annotation, or None if not present.

property no_label_annotation_set: konfuzio_sdk.data.AnnotationSet

Return the Annotation Set for project.no_label Annotations.

We need to load the Annotation Sets from Server first (call self.annotation_sets()). If we create the no_label_annotation_set in the first place, the data from the Server is not be loaded anymore because _annotation_sets will no longer be None.

property number_of_lines: int

Calculate the number of lines.

property number_of_pages: int

Calculate the number of Pages.

property ocr_file_path

Return path to OCR PDF file.

property ocr_ready

Check if Document OCR is ready.

pages() List[konfuzio_sdk.data.Page]

Get Pages of Document.

propose_splitting(splitting_ai) List

Propose splitting for a multi-file Document.

Parameters

splitting_ai – An initialized SplittingAI class

save()

Save all local changes to Document to server.

save_meta_data()

Save local changes to Document metadata to server.

set_bboxes(characters: Dict[int, konfuzio_sdk.data.Bbox])

Set character Bbox dictionary.

set_category(category: konfuzio_sdk.data.Category) None

Set the Category of the Document and the Category of all of its Pages as revised.

set_text_bbox_hashes() None

Update hashes of Document text and bboxes. Can be used for checking later on if any changes happened.

spans(label: Optional[konfuzio_sdk.data.Label] = None, use_correct: bool = False, start_offset: int = 0, end_offset: Optional[int] = None, fill: bool = False) List[konfuzio_sdk.data.Span]

Return all Spans of the Document.

property text

Get Document text. Once loaded stored in memory.

update()

Update Document information.

update_meta_data(assignee: Optional[int] = None, category_template: Optional[int] = None, category: Optional[konfuzio_sdk.data.Category] = None, data_file_name: Optional[str] = None, dataset_status: Optional[int] = None, status: Optional[List[Union[int, str]]] = None, **kwargs)

Update document metadata information.

view_annotations(start_offset: int = 0, end_offset: Optional[int] = None) List[konfuzio_sdk.data.Annotation]

Get the best Annotations, where the Spans are not overlapping.

Page

class konfuzio_sdk.data.Page(id_: Optional[int], document: konfuzio_sdk.data.Document, number: int, original_size: Tuple[float, float], image_size: Tuple[int, int] = (None, None), start_offset: Optional[int] = None, end_offset: Optional[int] = None, copy_of_id: Optional[int] = None)

Access the information about one Page of a Document.

For more details see https://dev.konfuzio.com/sdk/explanations.html#page-concept

add_category_annotation(category_annotation: konfuzio_sdk.data.CategoryAnnotation)

Annotate a Page with a Category and confidence information.

annotation_sets() List[konfuzio_sdk.data.AnnotationSet]

Show all Annotation Sets related to Annotations of the Page.

annotations(label: Optional[konfuzio_sdk.data.Label] = None, use_correct: bool = True, ignore_below_threshold: bool = False, start_offset: int = 0, end_offset: Optional[int] = None, fill: bool = False) List[konfuzio_sdk.data.Annotation]

Get Page Annotations.

property category: Optional[konfuzio_sdk.data.Category]

Get the Category of the Page, based on human revised Category Annotation, or on highest confidence.

get_annotations_image(display_all: bool = False) <module 'PIL.Image' from '/usr/local/lib/python3.8/site-packages/PIL/Image.py'>

Get Document Page as PNG with Annotations shown.

get_bbox()

Get bbox information per character of Page.

get_category_annotation(category, add_if_not_present: bool = False) konfuzio_sdk.data.CategoryAnnotation

Retrieve the Category Annotation associated with a specific Category within this Page.

If no Category Annotation is found for the provided Category, one can be created based on the add_if_not_present argument.

Parameters
  • category (Category) – The Category for which to retrieve the Category Annotation.

  • add_if_not_present (bool) – If True, a Category Annotation will be added to the current Page if none is found. If False, a dummy Category Annotation will be created, not linked to any Document or Page.

Returns

The located or newly created Category Annotation.

Return type

CategoryAnnotation

get_image(update: bool = False) PIL.Image.Image

Get Page as a Pillow Image object.

The Page image is loaded from a PNG file at Page.image_path. If the file is not present, or if update is True, it will be downloaded from the Konfuzio Host. Alternatively, if you don’t want to use a file, you can provide the image as bytes to Page.image_bytes. Then call this method to convert the bytes into a Pillow Image. In every case, the return value of this method and the attribute Page.image will be a Pillow Image.

Parameters

update – Whether to force download the Page PNG file.

Returns

A Pillow Image object for this Page’s image.

get_original_page() konfuzio_sdk.data.Page

Return an “original” Page in case the current Page is a copy without an ID.

An “original” Page is a Page from the Document that is not a copy and not a Virtual Document. This Page has an ID.

The method is used in the File Splitting pipeline to retain the original Document’s information in the Sub-Documents that were created from its splitting. The original Document is a Document that has an ID and is not a deepcopy.

lines() List[konfuzio_sdk.data.Span]

Return sorted list of Spans for each line in the Page.

property maximum_confidence_category_annotation: Optional[konfuzio_sdk.data.CategoryAnnotation]

Get the human revised Category Annotation of this Page, or the highest confidence one if not revised.

Returns

The found Category Annotation, or None if not present.

property number_of_lines: int

Calculate the number of lines in Page.

set_category(category: konfuzio_sdk.data.Category) None

Set the Category of the Page.

Parameters

category – The Category to set for the Page.

spans(label: Optional[konfuzio_sdk.data.Label] = None, use_correct: bool = False, start_offset: int = 0, end_offset: Optional[int] = None, fill: bool = False) List[konfuzio_sdk.data.Span]

Return all Spans of the Page.

property text

Get Document text corresponding to the Page.

view_annotations() List[konfuzio_sdk.data.Annotation]

Get the best Annotations, where the Spans are not overlapping in Page.

Project

class konfuzio_sdk.data.Project(id_: Optional[int], project_folder=None, update=False, max_ram=None, strict_data_validation: bool = True, credentials: dict = {}, **kwargs)

Access the information of a Project.

For more details see https://dev.konfuzio.com/sdk/explanations.html#project-concept

add_category(category: konfuzio_sdk.data.Category)

Add Category to Project, if it does not exist.

Parameters

category – Category to add in the Project

add_document(document: konfuzio_sdk.data.Document)

Add Document to Project, if it does not exist.

add_label(label: konfuzio_sdk.data.Label)

Add Label to Project, if it does not exist.

Parameters

label – Label to add in the Project

add_label_set(label_set: konfuzio_sdk.data.LabelSet)

Add Label Set to Project, if it does not exist.

Parameters

label_set – Label Set to add in the Project

property ai_models

Return all AIs.

del_document_by_id(document_id: int, delete_online: bool = False) konfuzio_sdk.data.Document

Delete Document by its ID.

delete()

Delete the Project folder.

property documents

Return Documents with status training.

property documents_folder: str

Calculate the regex folder of the Project.

download_training_and_test_data() None

Migrate your Project to another HOST.

See https://dev.konfuzio.com/web/migration-between-konfuzio-server-instances/index.html

property excluded_documents

Return Documents which have been excluded.

export_project_data(include_ais=False, training_and_test_documents=True) None

“Export the Project data including Training, Test Documents and AI models.

Include_ais

Whether to include AI models in the export

Training_and_test_documents

Whether to include training & test documents in the export.

get(update=False)

Access meta information of the Project.

Parameters

update – Update the downloaded information even it is already available

get_categories(reload: bool = True)

Load Categories for all Label Sets in the Project.

get_category_by_id(id_: int) konfuzio_sdk.data.Category

Return a Category by ID.

Parameters

id – ID of the Category to get.

get_credentials(key)

Return the value of the key in the credentials dict or in the config file.

Returns None and emits a warning if the key is not found.

Parameters

key – Key of the credential to get.

get_document_by_id(document_id: int) konfuzio_sdk.data.Document

Return Document by its ID.

get_label_by_id(id_: int) konfuzio_sdk.data.Label

Return a Label by ID.

Parameters

id – ID of the Label to get.

get_label_by_name(name: str) konfuzio_sdk.data.Label

Return Label by its name.

get_label_set_by_id(id_: int) konfuzio_sdk.data.LabelSet

Return a Label Set by ID.

Parameters

id – ID of the Label Set to get.

get_label_set_by_name(name: str) konfuzio_sdk.data.LabelSet

Return a Label Set by ID.

get_label_sets(reload=False)

Get LabelSets in the Project.

get_labels(reload=False) List[konfuzio_sdk.data.Label]

Get ID and name of any Label in the Project.

get_meta(reload=False)

Get the list of all Documents in the Project and their information.

Returns

Information of the Documents in the Project.

init_or_update_document(from_online=False)

Initialize or update Documents from local files to then decide about full, incremental or no update.

Parameters

from_online – If True, all Document metadata info is first reloaded with latest changes in the server

property label_sets

Return Project LabelSets.

property labels

Return Project Labels.

lose_weight()

Delete data of the instance.

property max_ram

Return maximum memory used by AI models.

property meta_data

Return Project meta data.

property model_folder: str

Calculate the model folder of the Project.

property no_status_documents

Return Documents with no status.

property online_documents_dict: Dict

Return a dictionary of online documents using their id as key.

property preparation_documents

Return Documents with status test.

property project_folder: str

Calculate the data document_folder of the Project.

property regex_folder: str

Calculate the regex folder of the Project.

property test_documents

Return Documents with status test.

property virtual_documents

Return Documents created virtually.

write_meta_of_files()

Overwrite meta-data of Documents in Project.

write_project_files()

Overwrite files with Project, Label, Label Set information.

API call wrappers

[source]

Connect to the Konfuzio Server to receive or send data.

TimeoutHTTPAdapter

class konfuzio_sdk.api.TimeoutHTTPAdapter(timeout, *args, **kwargs)

Combine a retry strategy with a timeout strategy.

build_response(req, resp)

Throw error for any HTTPError that is not part of the retry strategy.

send(request, *args, **kwargs)

Use timeout policy if not otherwise declared.

konfuzio_sdk.api.init_env(user: str, password: str, host: str = 'https://app.konfuzio.com', working_directory='/builds/konfuzio/dev', file_ending: str = '.env')

Add the .env file to the working directory.

Parameters
  • user – Username to log in to the host

  • password – Password to log in to the host

  • host – URL of host.

  • working_directory – Directory where file should be added

  • file_ending – Ending of file.

konfuzio_sdk.api.konfuzio_session(token: Optional[str] = None, timeout: Optional[int] = None, num_retries: Optional[int] = None, host: Optional[str] = None)

Create a session incl. Token to the KONFUZIO_HOST.

Parameters
  • token – Konfuzio Token to connect to the host.

  • timeout – Timeout in seconds.

  • num_retries – Number of retries if the request fails.

  • host – Host to connect to.

Returns

Request session.

konfuzio_sdk.api.get_project_list(session=None)

Get the list of all Projects for the user.

Parameters

session – Konfuzio session with Retry and Timeout policy

Returns

Response object

konfuzio_sdk.api.get_project_details(project_id: int, session=None) dict

Get Project’s metadata.

Parameters
  • project_id – ID of the Project

  • session – Konfuzio session with Retry and Timeout policy

Returns

Project metadata

konfuzio_sdk.api.get_project_labels(project_id: int, session=None) dict

Get Project’s Labels.

Parameters
  • project_id – An ID of a Project to get Labels from.

  • session – Konfuzio session with Retry and Timeout policy

konfuzio_sdk.api.get_project_label_sets(project_id: int, session=None) dict

Get Project’s Label Sets.

Parameters
  • project_id – An ID of a Project to get Label Sets from.

  • session – Konfuzio session with Retry and Timeout policy

konfuzio_sdk.api.create_new_project(project_name, session=None)

Create a new Project for the user.

Parameters
  • project_name – name of the project you want to create

  • session – Konfuzio session with Retry and Timeout policy

Returns

Response object

konfuzio_sdk.api.get_document_details(document_id: int, session=None)

Use the text-extraction server to retrieve the data from a document.

Parameters
  • document_id – ID of the document

  • session – Konfuzio session with Retry and Timeout policy

Returns

Data of the document.

konfuzio_sdk.api.get_document_annotations(document_id: int, session=None)

Get Annotations of a Document.

Parameters
  • document_id – ID of the Document.

  • session – Konfuzio session with Retry and Timeout policy

Returns

List of the Annotations of the Document.

konfuzio_sdk.api.get_document_bbox(document_id: int, session=None)

Get Bboxes for a Document.

Parameters
  • document_id – ID of the Document.

  • session – Konfuzio session with Retry and Timeout policy

Returns

List of Bboxes of characters in the Document

konfuzio_sdk.api.get_page_image(document_id: int, page_number: int, session=None, thumbnail: bool = False)

Load image of a Page as Bytes.

Parameters
  • page_number – Number of the Page

  • thumbnail – Download Page image as thumbnail

  • session – Konfuzio session with Retry and Timeout policy

Returns

Bytes of the Image.

konfuzio_sdk.api.post_document_annotation(document_id: int, spans: List, label_id: int, confidence: Optional[float] = None, revised: bool = False, is_correct: bool = False, session=None, **kwargs)

Add an Annotation to an existing document.

You must specify either annotation_set_id or label_set_id.

Use annotation_set_id if an Annotation Set already exists. You can find the list of existing Annotation Sets by using the GET endpoint of the Document.

Using label_set_id will create a new Annotation Set associated with that Label Set. You can only do this if the Label Set has has_multiple_sections set to True.

Parameters
  • document_id – ID of the file

  • spans – Spans that constitute the Annotation

  • label_id – ID of the Label

  • confidence – Confidence of the Annotation still called Accuracy by text-annotation

  • revised – If the Annotation is revised or not (bool)

  • is_correct – If the Annotation is corrected or not (bool)

  • session – Konfuzio session with Retry and Timeout policy

Returns

Response status.

konfuzio_sdk.api.change_document_annotation(annotation_id: int, session=None, **kwargs)

Change something about an Annotation.

Parameters
  • annotation_id – ID of an Annotation to be changed

  • session – Konfuzio session with Retry and Timeout policy

Returns

Response status.

konfuzio_sdk.api.delete_document_annotation(annotation_id: int, session=None, delete_from_database: bool = False, **kwargs)

Delete a given Annotation of the given document.

For AI training purposes, we recommend setting delete_from_database to False if you don’t want to remove Annotation permanently. This creates a negative feedback Annotation and does not remove it from the database.

Parameters
  • annotation_id – ID of the annotation

  • session – Konfuzio session with Retry and Timeout policy

Returns

Response status.

konfuzio_sdk.api.update_document_konfuzio_api(document_id: int, session=None, **kwargs)

Update an existing Document via Konfuzio API.

Parameters
  • document_id – ID of the document

  • session – Konfuzio session with Retry and Timeout policy

Returns

Response status.

konfuzio_sdk.api.download_file_konfuzio_api(document_id: int, ocr: bool = True, session=None)

Download file from the Konfuzio server using the Document id_.

Django authentication is form-based, whereas DRF uses BasicAuth.

Parameters
  • document_id – ID of the document

  • ocr – Bool to get the ocr version of the document

  • session – Konfuzio session with Retry and Timeout policy

Returns

The downloaded file.

konfuzio_sdk.api.get_results_from_segmentation(doc_id: int, project_id: int, session=None) List[List[dict]]

Get bbox results from segmentation endpoint.

Parameters
  • doc_id – ID of the document

  • project_id – ID of the Project.

  • session – Konfuzio session with Retry and Timeout policy

konfuzio_sdk.api.get_project_categories(project_id: Optional[int] = None, session=None) List[Dict]

Get a list of Categories of a Project.

Parameters
  • project_id – ID of the Project.

  • session – Konfuzio session with Retry and Timeout policy

konfuzio_sdk.api.upload_ai_model(ai_model_path: str, project_id: Optional[int] = None, category_id: Optional[int] = None, session=None)

Upload an ai_model to the text-annotation server.

Parameters
  • ai_model_path – Path to the ai_model

  • project_id – An ID of a Project to which the AI is uploaded. Needed for the File Splitting and Categorization

AIs because they function on a Project level. :param category_id: An ID of a Category on which the AI is trained. Needed for the Extraction AI because it functions on a Category level and requires a single Category. :param session: session to connect to server :raises: ValueError when neither project_id nor category_id is specified. :raises: HTTPError when a request is unsuccessful. :return:

konfuzio_sdk.api.delete_ai_model(ai_model_id: int, ai_type: str, session=None)

Delete an AI model from the server.

Parameters
  • ai_model_id – an ID of the model to be deleted.

  • ai_type – Should be one of the following: ‘filesplitting’, ‘extraction’, ‘categorization’.

  • session – session to connect to the server.

Raises

ValueError if ai_type is not correctly specified.

Raises

ConnectionError when a request is unsuccessful.

konfuzio_sdk.api.update_ai_model(ai_model_id: int, ai_type: str, patch: bool = True, session=None, **kwargs)

Update an AI model from the server.

Parameters
  • ai_model_id – an ID of the model to be updated.

  • ai_type – Should be one of the following: ‘filesplitting’, ‘extraction’, ‘categorization’.

  • patch – If true, adds info instead of replacing it.

  • session – session to connect to the server.

Raises

ValueError if ai_type is not correctly specified.

Raises

HTTPError when a request is unsuccessful.

konfuzio_sdk.api.get_all_project_ais(project_id: int, session=None) dict

Fetch all types of AIs for a specific project.

Parameters
  • project_id – ID of the Project

  • session – Konfuzio session with Retry and Timeout policy

  • host – Konfuzio host

Returns

Dictionary with lists of all AIs for a specific project

konfuzio_sdk.api.export_ai_models(project, session=None) int

Export all AI Model files for a specific Project.

Param

project: Konfuzio Project

Returns

Number of exported AIs

CLI tools

[source]

Command Line interface to the konfuzio_sdk package.

konfuzio_sdk.cli.parse_args(parser)

Parse command line arguments using sub-parsers for each command.

konfuzio_sdk.cli.credentials(args)

Retrieve user input or use CLI arguments.

Extras

[source]

Initialize AI-related dependencies safely.

PackageWrapper

class konfuzio_sdk.extras.PackageWrapper(package_name: str, required_for_modules: Optional[List[str]] = None)

Heavy dependencies are encapsulated and handled if they are not part of the lightweight SDK installation.

ModuleWrapper

class konfuzio_sdk.extras.ModuleWrapper(module: str)

Handle missing dependencies’ classes to avoid metaclass conflict.

Normalization

[source]

Convert the Span according to the data_type of the Annotation.

konfuzio_sdk.normalize.normalize_to_float(offset_string: str) Optional[float]

Given an offset_string: str this function tries to translate the offset-string to a number.

konfuzio_sdk.normalize.normalize_to_positive_float(offset_string: str) Optional[float]

Given an offset_string this function tries to translate the offset-string to an absolute number (ignores +/-).

konfuzio_sdk.normalize.normalize_to_percentage(offset_string: str) Optional[float]

Given an Annotation this function tries to translate the offset-string to an percentage -a float between 0 -1.

konfuzio_sdk.normalize.normalize_to_date(offset_string: str) Optional[str]

Given an Annotation this function tries to translate the offset-string to a date in the format ‘DD.MM.YYYY’.

konfuzio_sdk.normalize.normalize_to_bool(offset_string: str)

Given an offset_string this function tries to translate the offset-string to a bool.

konfuzio_sdk.normalize.roman_to_float(offset_string: str) Optional[float]

Convert a Roman numeral to an integer.

konfuzio_sdk.normalize.normalize(offset_string, data_type)

Wrap all normalize functionality.

Utils

[source]

Utils for the konfuzio sdk package.

konfuzio_sdk.utils.sdk_isinstance(instance, klass)

Implement a custom isinstance which is compatible with cloudpickle saving by value.

When using cloudpickle with “register_pickle_by_value” the classes of “konfuzio.data” will be loaded in the “types” module. For this case the builtin method “isinstance” will return False because it tries to compare “types.Document” with “konfuzio_sdk.data.Document”.

konfuzio_sdk.utils.exception_or_log_error(msg: str, handler: str = 'sdk', fail_loudly: typing.Optional[bool] = True, exception_type: typing.Optional[typing.Type[Exception]] = <class 'ValueError'>) None

Log error or raise an exception.

This function is needed to control error handling in production. If fail_loudly is set to True, the function raises an exception to type exception_type with a message and handler in the format `{“message” : msg,

“handler” : handler}`.

If fail_loudly is set to False, the function logs an error with msg using the logger.

Parameters
  • msg – (str): The error message to be logged or raised.

  • handler – (str): The handler associated with the error. Defaults to “sdk”

  • fail_loudly – A flag indicating whether to raise an exception or log the error. Defaults to True.

  • exception_type – The type of exception to be raised. Defaults to ValueError.

Returns

None

konfuzio_sdk.utils.get_id(include_time: bool = False) str

Generate a unique ID.

Parameters

include_time – Bool to include the time in the unique ID

Returns

Unique ID

konfuzio_sdk.utils.is_file(file_path, raise_exception=True, maximum_size=100000000, allow_empty=False) bool

Check if file is available or raise error if it does not exist.

Parameters
  • file_path – Path to the file to be checked

  • raise_exception – Will raise an exception if file is not available

  • maximum_size – Maximum size of the expected file, default < 100 mb

  • allow_empty – Bool to allow empty files

Returns

True or false depending on the existence of the file

konfuzio_sdk.utils.memory_size_of(obj) int

Return memory size of object in bytes.

konfuzio_sdk.utils.normalize_memory(memory: Union[None, str]) Optional[int]

Return memory size in human-readable form to int of number of bytes.

Parameters

memory – Memory size in human readable form (e.g. “50MB”).

Returns

int of bytes if valid, else None

konfuzio_sdk.utils.get_timestamp(konfuzio_format='%Y-%m-%d-%H-%M-%S') str

Return formatted timestamp.

Parameters

konfuzio_format – Format of the timestamp (e.g. year-month-day-hour-min-sec)

Returns

Timestamp

konfuzio_sdk.utils.load_image(input_file: Union[str, _io.BytesIO])

Load an image by path or via io.Bytes, e.g. via download by URL.

Parameters

input_file – Path to image or image in bytes format

Returns

Loaded image

konfuzio_sdk.utils.get_file_type(input_file: Optional[Union[str, _io.BytesIO, bytes]] = None) str

Get the type of a file.

Parameters

input_file – Path to the file or file in bytes format

Returns

Name of file type

konfuzio_sdk.utils.get_file_type_and_extension(input_file: Optional[Union[str, _io.BytesIO, bytes]] = None) Tuple[str, str]

Get the type of a file via the filetype library, which checks the magic bytes to see the internal format.

Parameters

input_file – Path to the file or file in bytes format

Returns

Name of file type

konfuzio_sdk.utils.does_not_raise()

Serve a complement to raise, no-op context manager does_not_raise.

docs.pytest.org/en/latest/example/parametrize.html#parametrizing-conditional-raising

konfuzio_sdk.utils.convert_to_bio_scheme(document) List[Tuple[str, str]]

Mark all the entities in the text as per the BIO scheme.

The splitting is using the sequence of words, expecting some characters like “.” a separate token.

Hello O , O it O ‘s O Helm B-ORG und I-ORG Nagel I-ORG . O

Parameters

document – Document to be converted into the bio scheme

Returns

list of tuples with each word in the text an the respective Label

konfuzio_sdk.utils.slugify(value)

Taken from https://github.com/django/django/blob/master/django/utils/text.py.

Convert to ASCII if ‘allow_unicode’ is False. Convert spaces or repeated dashes to single dashes. Remove characters that aren’t alphanumerics, underscores, or hyphens. Convert to lowercase. Also strip leading and trailing whitespace, dashes, and underscores.

konfuzio_sdk.utils.amend_file_name(file_name: str, append_text: str = '', append_separator: str = '_', new_extension: Optional[str] = None) str

Append text to a filename in front of extension.

example found here: https://stackoverflow.com/a/37487898

Parameters
  • new_extension – Change the file extension

  • file_path – Name of a file, e.g. file.pdf

  • append_text – Text you you want to append between file name ane extension

Returns

extended path to file

konfuzio_sdk.utils.amend_file_path(file_path: str, append_text: str = '', append_separator: str = '_', new_extension: Optional[str] = None)

Similar to amend_file_name however the file_name is interpreted as a full path.

Parameters
  • new_extension – Change the file extension

  • file_path – Name of a file, e.g. file.pdf

  • append_text – Text you you want to append between file name ane extension

Returns

extended path to file

konfuzio_sdk.utils.get_sentences(text: str, offsets_map: Optional[dict] = None, language: str = 'german') List[dict]

Split a text into sentences using the sentence tokenizer from the package nltk.

Parameters
  • text – Text to split into sentences

  • offsets_map – mapping between the position of the character in the input text and the offset in the text

of the document :param language: language of the text :return: List with a dict per sentence with its text and its start and end offsets in the text of the document.

konfuzio_sdk.utils.map_offsets(characters_bboxes: list) dict

Map the position of the character to its offset.

E.g.: characters: x, y, z, w characters offsets: 2, 3, 20, 22

The first character (x) has the offset 2. The fourth character (w) has the offset 22. …

offsets_map: {0: 2, 1: 3, 2: 20, 3: 22}

Parameters

characters_bboxes – Bounding boxes information of the characters.

Returns

Mapping of the position of the characters and its offsets.

konfuzio_sdk.utils.detectron_get_paragraph_bboxes(detectron_document_results: List[List[Dict]], document) List[List[Bbox]]

Call detectron Bbox corresponding to each paragraph.

konfuzio_sdk.utils.iter_before_and_after(iterable, before=1, after=None, fill=None)

Iterate and provide before and after element. Generalized from http://stackoverflow.com/a/1012089.

konfuzio_sdk.utils.get_sdk_version()

Get a version of current Konfuzio SDK used.

konfuzio_sdk.utils.get_spans_from_bbox(selection_bbox: Bbox) List[Span]

Get a list of Spans for all the text contained within a Bbox.

Tokenizers

[source]

Generic tokenizer.

Abstract Tokenizer

class konfuzio_sdk.tokenizer.base.AbstractTokenizer

Abstract definition of a Tokenizer.

evaluate(document: konfuzio_sdk.data.Document) pandas.core.frame.DataFrame

Compare a Document with its tokenized version.

Parameters

document – Document to evaluate

Returns

Evaluation DataFrame

evaluate_dataset(dataset_documents: List[konfuzio_sdk.data.Document]) konfuzio_sdk.evaluate.ExtractionEvaluation

Evaluate the tokenizer on a dataset of documents.

Parameters

dataset_documents – Documents to evaluate

Returns

ExtractionEvaluation instance

abstract found_spans(document: konfuzio_sdk.data.Document) List[konfuzio_sdk.data.Span]

Find all Spans in a Document that can be found by a Tokenizer.

get_runtime_info() pandas.core.frame.DataFrame

Get the processing runtime information as DataFrame.

Returns

processing time Dataframe containing the processing duration of all steps of the tokenization.

lose_weight()

Delete processing steps.

missing_spans(document: konfuzio_sdk.data.Document) List[konfuzio_sdk.data.Span]

Apply a Tokenizer on a Document and find all Spans that cannot be found.

Use this approach to sequentially work on remaining Spans after a Tokenizer ran on a List of Documents.

Parameters

document – A Document

Returns

A list containing all missing Spans.

abstract tokenize(document: konfuzio_sdk.data.Document) konfuzio_sdk.data.Document

Create Annotations with 1 Span based on the result of the Tokenizer.

Parameters

document – Document to tokenize, can have been tokenized before

Returns

Document with Spans created by the Tokenizer.

List Tokenizer

class konfuzio_sdk.tokenizer.base.ListTokenizer(tokenizers: List[konfuzio_sdk.tokenizer.base.AbstractTokenizer])

Use multiple tokenizers.

found_spans(document: konfuzio_sdk.data.Document) List[konfuzio_sdk.data.Span]

Run found_spans in the given order on a Document.

lose_weight()

Delete processing steps.

span_match(span: konfuzio_sdk.data.Span) bool

Run span_match in the given order.

tokenize(document: konfuzio_sdk.data.Document) konfuzio_sdk.data.Document

Run tokenize in the given order on a Document.

Rule Based Tokenizer

Regex tokenizers.

Regex Tokenizer

class konfuzio_sdk.tokenizer.regex.RegexTokenizer(regex: str)

Tokenizer based on a single regex.

found_spans(document: konfuzio_sdk.data.Document) List[konfuzio_sdk.data.Span]

Find Spans found by the Tokenizer and add Tokenizer info to Span.

Parameters

document – Document with Annotation to find.

Returns

List of Spans found by the Tokenizer.

span_match(span: konfuzio_sdk.data.Span) bool

Check if Span is detected by Tokenizer.

tokenize(document: konfuzio_sdk.data.Document) konfuzio_sdk.data.Document

Create Annotations with 1 Span based on the result of the Tokenizer.

Parameters

document – Document to tokenize, can have been tokenized before

Returns

Document with Spans created by the Tokenizer.

Regex tokenizers.

class konfuzio_sdk.tokenizer.regex.CapitalizedTextTokenizer

Tokenizer based on capitalized text.

Example:

“Company is Company A&B GmbH now” -> “Company A&B GmbH”

class konfuzio_sdk.tokenizer.regex.ColonOrWhitespacePrecededTokenizer

Tokenizer based on text preceded by colon.

Example:

“write to: name” -> “name”

class konfuzio_sdk.tokenizer.regex.ColonPrecededTokenizer

Tokenizer based on text preceded by colon.

Example:

“write to: name” -> “name”

class konfuzio_sdk.tokenizer.regex.ConnectedTextTokenizer

Tokenizer based on text connected by 1 whitespace.

Example:

r”This is na description. Occupies a paragraph.” -> “This is”, “a description. Occupies a paragraph.”

class konfuzio_sdk.tokenizer.regex.LineUntilCommaTokenizer

Tokenizer based on text preceded by colon.

Example:

“n Company und A&B GmbH,n” -> “Company und A&B GmbH”

class konfuzio_sdk.tokenizer.regex.NonTextTokenizer

Tokenizer based on non text - numbers and separators.

Example:

“date 01. 01. 2022” -> “01. 01. 2022”

class konfuzio_sdk.tokenizer.regex.NumbersTokenizer

Tokenizer based on numbers.

Example:

“N. 1242022 123 ” -> “1242022 123”

class konfuzio_sdk.tokenizer.regex.RegexTokenizer(regex: str)

Tokenizer based on a single regex.

found_spans(document: konfuzio_sdk.data.Document) List[konfuzio_sdk.data.Span]

Find Spans found by the Tokenizer and add Tokenizer info to Span.

Parameters

document – Document with Annotation to find.

Returns

List of Spans found by the Tokenizer.

span_match(span: konfuzio_sdk.data.Span) bool

Check if Span is detected by Tokenizer.

tokenize(document: konfuzio_sdk.data.Document) konfuzio_sdk.data.Document

Create Annotations with 1 Span based on the result of the Tokenizer.

Parameters

document – Document to tokenize, can have been tokenized before

Returns

Document with Spans created by the Tokenizer.

class konfuzio_sdk.tokenizer.regex.WhitespaceNoPunctuationTokenizer

Tokenizer based on whitespaces without punctuation.

Example:

“street Name 1-2b,” -> “street”, “Name”, “1-2b”

class konfuzio_sdk.tokenizer.regex.WhitespaceTokenizer

Tokenizer based on whitespaces.

Example:

“street Name 1-2b,” -> “street”, “Name”, “1-2b,”

Sentence and Paragraph tokenizers.

Paragraph Tokenizer

class konfuzio_sdk.tokenizer.paragraph_and_sentence.ParagraphTokenizer(mode: str = 'detectron', line_height_ratio: float = 0.8, height: Optional[Union[int, float]] = None, create_detectron_labels: bool = False)

Tokenizer splitting Document into paragraphs.

found_spans(document: konfuzio_sdk.data.Document)

Sentence found spans.

tokenize(document: konfuzio_sdk.data.Document) konfuzio_sdk.data.Document

Create one multiline Annotation per paragraph detected.

Sentence Tokenizer

class konfuzio_sdk.tokenizer.paragraph_and_sentence.SentenceTokenizer(mode: str = 'detectron', line_height_ratio: float = 0.8, height: Optional[Union[int, float]] = None, create_detectron_labels: bool = False)

Tokenizer splitting Document into Sentences.

found_spans(document: konfuzio_sdk.data.Document)

Sentence found spans.

tokenize(document: konfuzio_sdk.data.Document) konfuzio_sdk.data.Document

Create one multiline Annotation per sentence detected.

Extraction AI

[source]

Extract information from Documents.

Conventional template matching based approaches fail to generalize well to document images of unseen templates, and are not robust against text recognition errors.

We follow the approach proposed by Sun et al. (2021) to encode both the visual and textual features of detected text regions, and edges of which represent the spatial relations between neighboring text regions. Their experiments validate that all information including visual features, textual features and spatial relations can benefit key information extraction.

We reduce the hardware requirements from 1 NVIDIA Titan X GPUs with 12 GB memory to a 1 CPU and 16 GB memory by replacing the end-to-end pipeline into two parts.

Sun, H., Kuang, Z., Yue, X., Lin, C., & Zhang, W. (2021). Spatial Dual-Modality Graph Reasoning for Key Information Extraction. arXiv. https://doi.org/10.48550/ARXIV.2103.14470

Base Model

class konfuzio_sdk.trainer.information_extraction.BaseModel

Base model to define common methods for all AIs.

abstract check_is_ready()

Check if the Model is ready for inference.

ensure_model_memory_usage_within_limit(max_ram: Optional[str] = None)

Ensure that a model is not exceeding allowed max_ram.

Parameters

max_ram (str) – Specify maximum memory usage condition to save model.

abstract static has_compatible_interface(other)

Validate that an instance of an AI implements the same interface defined by this AI class.

Parameters

other – An instance of an AI to compare with.

static load_model(pickle_path: str, max_ram: Optional[str] = None)

Load a previously saved instance of the model.

Parameters

pickle_path (str) – Path to the pickled model.

Raises
  • FileNotFoundError – If the path is invalid.

  • OSError – When the data is corrupted or invalid and cannot be loaded.

  • TypeError – When the loaded pickle isn’t recognized as a Konfuzio AI model.

Returns

Extraction AI model.

property name

Model class name.

name_lower()

Convert class name to machine-readable name.

abstract property pkl_file_path

Generate a path for a resulting pickle file.

reduce_model_weight()

Remove all non-strictly necessary parameters before saving.

save(output_dir: Optional[str] = None, include_konfuzio=True, reduce_weight=True, compression: str = 'lz4', keep_documents=False, max_ram=None)

Save the label model as a compressed pickle object to the release directory.

Saving is done by: getting the serialized pickle object (via cloudpickle), “optimizing” the serialized object with the built-in pickletools.optimize function (see: https://docs.python.org/3/library/pickletools.html), saving the optimized serialized object.

We then compress the pickle file using shutil.copyfileobject which writes in chunks to avoid loading the entire pickle file in memory.

Finally, we delete the cloudpickle file and are left with the compressed pickle file which has a .pkl.lz4 or .pkl.bz2 extension.

For more info on pickle serialization and including dependencies read https://github.com/cloudpipe/cloudpickle#overriding-pickles-serialization-mechanism-for-importable-constructs

Parameters
  • output_dir – Folder to save AI model in. If None, the default Project folder is used.

  • include_konfuzio – Enables pickle serialization as a value, not as a reference.

  • reduce_weight – Remove all non-strictly necessary parameters before saving.

  • compression – Compression algorithm to use. Default is lz4, bz2 is also supported.

  • max_ram – Specify maximum memory usage condition to save model.

Raises

MemoryError – When the size of the model in memory is greater than the maximum value.

Returns

Path of the saved model file.

abstract property temp_pkl_file_path

Generate a path for temporary pickle file.

AbstractExtractionAI

class konfuzio_sdk.trainer.information_extraction.AbstractExtractionAI(category: konfuzio_sdk.data.Category, *args, **kwargs)

Parent class for all Extraction AIs, to extract information from unstructured human-readable text.

static add_extractions_as_annotations(extractions: pandas.core.frame.DataFrame, document: konfuzio_sdk.data.Document, label: konfuzio_sdk.data.Label, label_set: konfuzio_sdk.data.LabelSet, annotation_set: konfuzio_sdk.data.AnnotationSet) None

Add the extraction of a model to the document.

check_is_ready()

Check if the ExtractionAI is ready for the inference.

It is assumed that the model is ready if a Category is set, and is ready for extraction.

Raises

AttributeError – When no Category is specified.

evaluate()

Use as placeholder Function.

extract(document: konfuzio_sdk.data.Document) konfuzio_sdk.data.Document

Perform preliminary extraction steps.

extraction_result_to_document(document: konfuzio_sdk.data.Document, extraction_result: dict) konfuzio_sdk.data.Document

Return a virtual Document annotated with AI Model output.

fit()

Use as placeholder Function because the Abstract AI does not train a classifier.

static flush_buffer(buffer: List[pandas.core.series.Series], doc_text: str) Dict

Merge a buffer of entities into a dictionary (which will eventually be turned into a DataFrame).

A buffer is a list of pandas.Series objects.

static has_compatible_interface(other) bool

Validate that an instance of an Extraction AI implements the same interface as AbstractExtractionAI.

An Extraction AI should implement methods with the same signature as: - AbstractExtractionAI.__init__ - AbstractExtractionAI.fit - AbstractExtractionAI.extract - AbstractExtractionAI.check_is_ready

Parameters

other – An instance of an Extraction AI to compare with.

static is_valid_horizontal_merge(row: pandas.core.series.Series, buffer: List[pandas.core.series.Series], doc_text: str, max_offset_distance: int = 5) bool

Verify if the merging that we are trying to do is valid.

A merging is valid only if:
  • All spans have the same predicted Label

  • Confidence of predicted Label is above the Label threshold

  • All spans are on the same line

  • No extraneous characters in between spans

  • A maximum of 5 spaces in between spans

  • The Label type is not one of the following: ‘Number’, ‘Positive Number’, ‘Percentage’, ‘Date’ OR the resulting merging create a span normalizable to the same type

Parameters
  • row – Row candidate to be merged to what is already in the buffer.

  • buffer – Previous information.

  • doc_text – Text of the document.

  • max_offset_distance – Maximum distance between two entities that can be merged.

Returns

If the merge is valid or not.

static load_model(pickle_path: str, max_ram: Optional[str] = None)

Load the model and check if it has the interface compatible with the class.

Parameters

pickle_path (str) – Path to the pickled model.

Raises
  • FileNotFoundError – If the path is invalid.

  • OSError – When the data is corrupted or invalid and cannot be loaded.

  • TypeError – When the loaded pickle isn’t recognized as a Konfuzio AI model.

Returns

Extraction AI model.

classmethod merge_horizontal(res_dict: Dict, doc_text: str) Dict

Merge contiguous spans with same predicted label.

See more details at https://dev.konfuzio.com/sdk/explanations.html#horizontal-merge

property pkl_file_path: str

Generate a path for a resulting pickle file.

property project

Get RFExtractionAI Project.

property temp_pkl_file_path: str

Generate a path for temporary pickle file.

Random Forest Extraction AI

class konfuzio_sdk.trainer.information_extraction.RFExtractionAI(n_nearest: int = 2, first_word: bool = True, n_estimators: int = 100, max_depth: int = 100, no_label_limit: Optional[Union[int, float]] = None, n_nearest_across_lines: bool = False, use_separate_labels: bool = True, category: Optional[konfuzio_sdk.data.Category] = None, tokenizer=None, *args, **kwargs)

Encode visual and textual features to extract text regions.

Fit an extraction pipeline to extract linked Annotations.

Both Label and Label Set classifiers are using a RandomForestClassifier from scikit-learn to run in a low memory and single CPU environment. A random forest classifier is a group of decision trees classifiers, see: https://scikit-learn.org/stable/modules/generated/sklearn.ensemble.RandomForestClassifier.html

The parameters of this class allow to select the Tokenizer, to configure the Label and Label Set classifiers and to select the type of features used by the Label and Label Set classifiers.

They are divided in: - tokenizer selection - parametrization of the Label classifier - parametrization of the Label Set classifier - features for the Label classifier - features for the Label Set classifier

By default, the text of the Documents is split into smaller chunks of text based on whitespaces (‘WhitespaceTokenizer’). That means that all words present in the text will be shown to the AI. It is possible to define if the splitting of the text into smaller chunks should be done based on regexes learned from the Spans of the Annotations of the Category (‘tokenizer_regex’) or if to use a model from Spacy library for German language (‘tokenizer_spacy’). Another option is to use a pre-defined list of tokenizers based on regexes (‘tokenizer_regex_list’) and, on top of the pre-defined list, to create tokenizers that match what is missed by those (‘tokenizer_regex_combination’).

Some parameters of the scikit-learn RandomForestClassifier used for the Label and/or Label Set classifier can be set directly in Konfuzio Server (‘label_n_estimators’, ‘label_max_depth’, ‘label_class_weight’, ‘label_random_state’, ‘label_set_n_estimators’, ‘label_set_max_depth’).

Features are measurable pieces of data of the Annotation. By default, a combination of features is used that includes features built from the text of the Annotation (‘string_features’), features built from the position of the Annotation in the Document (‘spatial_features’) and features from the Spans created by a WhitespaceTokenizer on the left or on the right of the Annotation (‘n_nearest_left’, ‘n_nearest_right’, ‘n_nearest_across_lines). It is possible to exclude any of them (‘spatial_features’, ‘string_features’, ‘n_nearest_left’, ‘n_nearest_right’) or to specify the number of Spans created by a WhitespaceTokenizer to consider (‘n_nearest_left’, ‘n_nearest_right’).

While extracting, the Label Set classifier takes the predictions from the Label classifier as input. The Label Set classifier groups them into Annotation sets.

check_is_ready()

Check if the ExtractionAI is ready for the inference.

It is assumed that the model is ready if a Tokenizer and a Category were set, Classifiers were set and trained.

Raises
  • AttributeError – When no Tokenizer is specified.

  • AttributeError – When no Category is specified.

  • AttributeError – When no Label Classifier has been provided.

evaluate_clf(use_training_docs: bool = False) konfuzio_sdk.evaluate.ExtractionEvaluation

Evaluate the Label classifier.

evaluate_full(strict: bool = True, use_training_docs: bool = False, use_view_annotations: bool = True) konfuzio_sdk.evaluate.ExtractionEvaluation

Evaluate the full pipeline on the pipeline’s Test Documents.

Parameters
  • strict – Evaluate on a Character exact level without any postprocessing.

  • use_training_docs – Bool for whether to evaluate on the training documents instead of testing documents.

Returns

Evaluation object.

evaluate_label_set_clf(use_training_docs: bool = False) konfuzio_sdk.evaluate.ExtractionEvaluation

Evaluate the LabelSet classifier.

evaluate_tokenizer(use_training_docs: bool = False) konfuzio_sdk.evaluate.ExtractionEvaluation

Evaluate the tokenizer.

extract(document: konfuzio_sdk.data.Document) konfuzio_sdk.data.Document

Infer information from a given Document.

Parameters

document – Document object

Returns

Document with predicted labels

Raises

AttributeError: When missing a Tokenizer NotFittedError: When CLF is not fitted

extract_from_df(df: pandas.core.frame.DataFrame, inference_document: konfuzio_sdk.data.Document) konfuzio_sdk.data.Document

Predict Labels from features.

feature_function(documents: List[konfuzio_sdk.data.Document], no_label_limit: Optional[Union[int, float]] = None, retokenize: Optional[bool] = None, require_revised_annotations: bool = False) Tuple[List[pandas.core.frame.DataFrame], list]

Calculate features per Span of Annotations.

Parameters
  • documents – List of Documents to extract features from.

  • no_label_limit – Int or Float to limit number of new Annotations to create during tokenization.

  • retokenize – Bool for whether to recreate Annotations from scratch or use already existing Annotations.

  • require_revised_annotations – Only allow calculation of features if no unrevised Annotation present.

Returns

Dataframe of features and list of feature names.

features(document: konfuzio_sdk.data.Document)

Calculate features using the best working default values that can be overwritten with self values.

filter_dataframe(df: pandas.core.frame.DataFrame) pandas.core.frame.DataFrame

Filter dataframe rows accordingly with the confidence value.

Rows (extractions) where the accuracy value is below the threshold defined for the label are removed.

Parameters

df – Dataframe with extraction results

Returns

Filtered dataframe

filter_low_confidence_extractions(result: Dict) Dict

Remove extractions with confidence below the threshold defined for the respective label.

The input is a dictionary where the values can be: - dataframe - dictionary where the values are dataframes - list of dictionaries where the values are dataframes

Parameters

result – Extraction results

Returns

Filtered dictionary.

fit() sklearn.ensemble._forest.RandomForestClassifier

Given training data and the feature list this function returns the trained regression model.

label_train_document(virtual_document: konfuzio_sdk.data.Document, original_document: konfuzio_sdk.data.Document)

Assign Labels to Annotations in newly tokenized virtual training Document.

merge_vertical(document: konfuzio_sdk.data.Document, only_multiline_labels=True)

Merge Annotations with the same Label.

See more details at https://dev.konfuzio.com/sdk/explanations.html#vertical-merge

Parameters
  • document – Document whose Annotations should be merged vertically

  • only_multiline_labels – Only merge if a multiline Label Annotation is in the Category Training set

merge_vertical_like(document: konfuzio_sdk.data.Document, template_document: konfuzio_sdk.data.Document)

Merge Annotations the same way as in another copy of the same Document.

All single-Span Annotations in the current Document (self) are matched with corresponding multi-line Spans in the given Document and are merged in the same way. The Label of the new multi-line Annotations is taken to be the most common Label among the original single-line Annotations that are being merged.

Parameters

document – Document with multi-line Annotations

reduce_model_weight()

Remove all non-strictly necessary parameters before saving.

remove_empty_dataframes_from_extraction(result: Dict) Dict

Remove empty dataframes from the result of an Extraction AI.

The input is a dictionary where the values can be: - dataframe - dictionary where the values are dataframes - list of dictionaries where the values are dataframes

property requires_segmentation: bool

Return True if the Extraction AI requires detectron segmentation results to process Documents.

separate_labels(res_dict: Dict) Dict

Undo the renaming of the labels.

In this way we have the output of the extraction in the correct format.

Categorization AI

[source]

Implements a Categorization Model.

Abstract Categorization AI

class konfuzio_sdk.trainer.document_categorization.AbstractCategorizationAI(categories: List[konfuzio_sdk.data.Category], *args, **kwargs)

Abstract definition of a CategorizationAI.

categorize(document: konfuzio_sdk.data.Document, recategorize: bool = False, inplace: bool = False) konfuzio_sdk.data.Document

Run categorization on a Document.

Parameters
  • document – Input Document

  • recategorize – If the input Document is already categorized, the already present Category is used unless

this flag is True

Parameters

inplace – Option to categorize the provided Document in place, which would assign the Category attribute

Returns

Copy of the input Document with added CategoryAnnotation information

check_is_ready()

Check if Categorization AI instance is ready for inference.

It is assumed that the model is ready when there is at least one Category passed as the input.

Raises

AttributeError – When no Categories are passed into the model.

evaluate(use_training_docs: bool = False) konfuzio_sdk.evaluate.CategorizationEvaluation

Evaluate the full Categorization pipeline on the pipeline’s Test Documents.

Parameters

use_training_docs – Bool for whether to evaluate on the Training Documents instead of Test Documents.

Returns

Evaluation object.

abstract fit() None

Train the Categorization AI.

static has_compatible_interface(other)

Validate that an instance of a Categorization AI implements the same interface as AbstractCategorizationAI.

A Categorization AI should implement methods with the same signature as: - AbstractCategorizationAI.__init__ - AbstractCategorizationAI.fit - AbstractCategorizationAI._categorize_page - AbstractCategorizationAI.check_is_ready

Parameters

other – An instance of a Categorization AI to compare with.

static load_model(pickle_path: str, device='cpu')

Load the model and check if it has the interface compatible with the class.

Parameters

pickle_path (str) – Path to the pickled model.

Raises
  • FileNotFoundError – If the path is invalid.

  • OSError – When the data is corrupted or invalid and cannot be loaded.

  • TypeError – When the loaded pickle isn’t recognized as a Konfuzio AI model.

Returns

Categorization AI model.

name_lower()

Convert class name to machine-readable name.

property pkl_file_path: str

Generate a path for a resulting pickle file.

Returns

A string with the path.

abstract save(output_dir: str, include_konfuzio=True)

Save the model to disk.

property temp_pkl_file_path: str

Generate a path for temporary pickle file.

Returns

A string with the path.

Name-based Categorization AI

class konfuzio_sdk.trainer.document_categorization.NameBasedCategorizationAI(categories: List[konfuzio_sdk.data.Category], *args, **kwargs)

A simple, non-trainable model that predicts a Category for a given Document based on a predefined rule.

It checks for whether the name of the Category is present in the input Document (case insensitive; also see Category.fallback_name). This can be an effective fallback logic to categorize Documents when no Categorization AI is available.

fit() None

Use as placeholder Function because there’s no classifier to be trainer.

save(output_dir: str, include_konfuzio=True)

Use as placeholder Function.

Model-based Categorization AI

class konfuzio_sdk.trainer.document_categorization.CategorizationAI(categories: List[konfuzio_sdk.data.Category], use_cuda: bool = False, *args, **kwargs)

A trainable AI that predicts a Category for each Page of a given Document.

build_document_classifier_iterator(documents, transforms, use_image: bool, use_text: bool, shuffle: bool, batch_size: int, max_len: int, device='cpu') torch.utils.data.dataloader.DataLoader

Prepare the data necessary for the document classifier, and build the iterators for the data list.

For each document we split into pages and from each page we take:
  • the path to an image of the page

  • the tokenized and numericalized text on the page

  • the label (category) of the page

  • the id of the document

  • the page number

build_preprocessing_pipeline(use_image: bool, image_augmentation=None, image_preprocessing=None) None

Set up the pre-processing and data augmentation when necessary.

build_template_category_vocab() konfuzio_sdk.tokenizer.base.Vocab

Build a vocabulary over the Categories.

build_text_vocab(min_freq: int = 1, max_size: Optional[int] = None) konfuzio_sdk.tokenizer.base.Vocab

Build a vocabulary over the document text.

property compressed_file_path: str

Generate a path for a resulting compressed file in .lz4 format.

Returns

A string with the path.

fit(max_len: Optional[bool] = None, batch_size: int = 1, **kwargs) Dict[str, List[float]]

Fit the CategorizationAI classifier.

reduce_model_weight()

Reduce the size of the model by running lose_weight on the tokenizer.

save(output_dir: Optional[str] = None, reduce_weight: bool = True, **kwargs) str

Save only the necessary parts of the model for extraction/inference.

Saves: - tokenizer (needed to ensure we tokenize inference examples in the same way that they are trained) - transforms (to ensure we transform/pre-process images in the same way as training) - vocabs (to ensure the tokens/labels are mapped to the same integers as training) - configs (to ensure we load the same models used in training) - state_dicts (the classifier parameters achieved through training)

Note: “path” is a deprecated parameter, “output_dir” is used for the sake of uniformity across all AIs.

Parameters
  • output_dir (str) – A path to save the model to.

  • reduce_weight (bool) – Reduces the weight of a model by removing Documents and reducing weight of a Tokenizer.

property temp_pt_file_path: str

Generate a path for s temporary model file in .pt format.

Returns

A string with the path.

Build a Model-based Categorization AI

konfuzio_sdk.trainer.document_categorization.build_categorization_ai_pipeline(categories: List[konfuzio_sdk.data.Category], documents: List[konfuzio_sdk.data.Document], test_documents: List[konfuzio_sdk.data.Document], tokenizer: Optional[konfuzio_sdk.tokenizer.base.AbstractTokenizer] = None, image_model_name: Optional[konfuzio_sdk.trainer.document_categorization.ImageModel] = None, text_model_name: Optional[konfuzio_sdk.trainer.document_categorization.TextModel] = TextModel.NBOW, **kwargs) konfuzio_sdk.trainer.document_categorization.CategorizationAI

Build a Categorization AI neural network by choosing an ImageModel and a TextModel.

See an in-depth tutorial at https://dev.konfuzio.com/sdk/tutorials/data_validation/index.html

NBOW Model

class konfuzio_sdk.trainer.document_categorization.NBOW(input_dim: int, emb_dim: int = 64, dropout_rate: float = 0.0, **kwargs)

The neural bag-of-words (NBOW) model is the simplest of models, it passes each token through an embedding layer.

As shown in the fastText paper (https://arxiv.org/abs/1607.01759) this model is still able to achieve comparable performance to some deep learning models whilst being considerably faster.

One downside of this model is that tokens are embedded without regards to the surrounding context in which they appear, e.g. the embedding for “May” in the two sentences “May I speak to you?” and “I am leaving on the 1st of May” are identical, even though they have different semantics.

Parameters
  • emb_dim – The dimensions of the embedding vector.

  • dropout_rate – The amount of dropout applied to the embedding vectors.

NBOW Self Attention Model

class konfuzio_sdk.trainer.document_categorization.NBOWSelfAttention(input_dim: int, emb_dim: int = 64, n_heads: int = 8, dropout_rate: float = 0.0, **kwargs)

This is an NBOW model with a multi-headed self-attention layer, which is added after the embedding layer.

See details at https://arxiv.org/abs/1706.03762. The self-attention layer effectively contextualizes the output as now each hidden state is calculated from the embedding vector of a token and the embedding vector of all other tokens within the sequence.

Parameters
  • emb_dim – The dimensions of the embedding vector.

  • dropout_rate – The amount of dropout applied to the embedding vectors.

  • n_heads – The number of attention heads to use in the multi-headed self-attention layer. Note that n_heads

must be a factor of emb_dim, i.e. emb_dim % n_heads == 0.

LSTM Model

class konfuzio_sdk.trainer.document_categorization.LSTM(input_dim: int, emb_dim: int = 64, hid_dim: int = 256, n_layers: int = 2, bidirectional: bool = True, dropout_rate: float = 0.0, **kwargs)

The LSTM (long short-term memory) is a variant of an RNN (recurrent neural network).

It feeds the input tokens through an embedding layer and then processes them sequentially with the LSTM, outputting a hidden state for each token. If the LSTM is bidirectional then it trains a forward and backward LSTM per layer and concatenates the forward and backward hidden states for each token.

Parameters
  • emb_dim – The dimensions of the embedding vector.

  • hid_dim – The dimensions of the hidden states.

  • n_layers – How many LSTM layers to use.

  • bidirectional – If the LSTM should be bidirectional.

  • dropout_rate – The amount of dropout applied to the embedding vectors and between LSTM layers if

n_layers > 1.

BERT Model

class konfuzio_sdk.trainer.document_categorization.BERT(name: str = 'bert-base-german-cased', freeze: bool = False, **kwargs)

Wraps around pre-trained BERT-type models from the HuggingFace library.

BERT (bidirectional encoder representations from Transformers) is a family of large Transformer models. The available BERT variants are all pre-trained models provided by the transformers library. It is usually infeasible to train a BERT model from scratch due to the significant amount of computation required. However, the pre-trained models can be easily fine-tuned on desired data.

The BERT variants, i.e. name arguments, that are covered by internal tests are:
  • bert-base-german-cased

  • bert-base-german-dbmdz-cased

  • bert-base-german-dbmdz-uncased

  • distilbert-base-german-cased

In theory, all variants beginning with bert-base-* and distilbert-* should work out of the box. Other BERT variants come with no guarantees.

Parameters
  • name – The name of the pre-trained BERT variant to use.

  • freeze – Should the BERT model be frozen, i.e. the pre-trained parameters are not updated.

get_max_length()

Get the maximum length of a sequence that can be passed to the BERT module.

VGG Model

class konfuzio_sdk.trainer.document_categorization.VGG(name: str = 'vgg11', pretrained: bool = True, freeze: bool = True, **kwargs)

The VGG family of models are image classification models designed for the ImageNet.

They are usually used as a baseline in image classification tasks, however are considerably larger - in terms of the number of parameters - than modern architectures.

Available variants are: vgg11, vgg13, vgg16, vgg19, vgg11_bn, vgg13_bn, vgg16_bn, vgg19_bn. The number generally indicates the number of layers in the model, higher does not always mean better. The _bn suffix means that the VGG model uses Batch Normalization layers, this generally leads to better results.

The pre-trained weights are taken from the [torchvision](https://github.com/pytorch/vision) library and are weights from a model that has been trained as an image classifier on ImageNet. Ideally, this means the images should be 3-channel color images that are at least 224x224 pixels and should be normalized.

Parameters
  • name – The name of the VGG variant to use

  • pretrained – If pre-trained weights for the VGG variant should be used

  • freeze – If the parameters of the VGG variant should be frozen

EfficientNet Model

class konfuzio_sdk.trainer.document_categorization.EfficientNet(name: str = 'efficientnet_b0', pretrained: bool = True, freeze: bool = True, **kwargs)

EfficientNet is a family of convolutional neural network based models that are designed to be more efficient.

The efficiency comes in terms of the number of parameters and FLOPS, compared to previous computer vision models whilst maintaining equivalent image classification performance.

Available variants are: efficientnet_b0, efficientnet_b1, …, efficienet_b7. With b0 having the least amount of parameters and b7 having the most.

The pre-trained weights are taken from the timm library and have been trained on ImageNet, thus the same tips, i.e. normalization, that apply to the VGG models also apply here.

Parameters
  • name – The name of the EfficientNet variant to use

  • pretrained – If pre-trained weights for the EfficientNet variant should be used

  • freeze – If the parameters of the EfficientNet variant should be frozen

get_n_features() int

Calculate number of output features based on given model.

Multimodal Concatenation

class konfuzio_sdk.trainer.document_categorization.MultimodalConcatenate(n_image_features: int, n_text_features: int, hid_dim: int = 256, output_dim: Optional[int] = None, **kwargs)

Defines how the image and text features are combined in order to yield a categorization prediction.

File Splitting AI

[source]

Process Documents that consist of several files and propose splitting them into the Sub-Documents accordingly.

Abstract File Splitting Model

class konfuzio_sdk.trainer.file_splitting.AbstractFileSplittingModel(categories: List[konfuzio_sdk.data.Category], *args, **kwargs)

Abstract class for the File Splitting model.

abstract fit(*args, **kwargs)

Fit the custom model on the training Documents.

static has_compatible_interface(other) bool

Validate that an instance of a File Splitting Model implements the same interface as AbstractFileSplittingModel.

A File Splitting Model should implement methods with the same signature as: - AbstractFileSplittingModel.__init__ - AbstractFileSplittingModel.predict - AbstractFileSplittingModel.fit - AbstractFileSplittingModel.check_is_ready

Parameters

other – An instance of a File Splitting Model to compare with.

static load_model(pickle_path: str, max_ram: Optional[str] = None)

Load the model and check if it has the interface compatible with the class.

Parameters

pickle_path (str) – Path to the pickled model.

Raises
  • FileNotFoundError – If the path is invalid.

  • OSError – When the data is corrupted or invalid and cannot be loaded.

  • TypeError – When the loaded pickle isn’t recognized as a Konfuzio AI model.

Returns

File Splitting AI model.

property pkl_file_path: str

Generate a path for a resulting pickle file.

Returns

A string with the path.

abstract predict(page: konfuzio_sdk.data.Page) konfuzio_sdk.data.Page

Take a Page as an input and reassign is_first_page attribute’s value if necessary.

Parameters

page (Page) – A Page to label first or non-first.

Returns

Page.

property temp_pkl_file_path: str

Generate a path for temporary pickle file.

Returns

A string with the path.

Context Aware File Splitting Model

class konfuzio_sdk.trainer.file_splitting.ContextAwareFileSplittingModel(categories: List[konfuzio_sdk.data.Category], tokenizer, *args, **kwargs)

A File Splitting Model that uses a context-aware logic.

Context-aware logic implies a rule-based approach that looks for common strings between the first Pages of all Category’s Documents.

check_is_ready()

Check File Splitting Model is ready for inference.

Raises
  • AttributeError – When no Tokenizer or no Categories were passed.

  • ValueError – When no Categories have _exclusive_first_page_strings.

fit(allow_empty_categories: bool = False, *args, **kwargs)

Gather the strings exclusive for first Pages in a given stream of Documents.

Exclusive means that each of these strings appear only on first Pages of Documents within a Category.

Parameters

allow_empty_categories – To allow returning empty list for a Category if no exclusive first-page strings

were found during fitting (which means prediction would be impossible for a Category). :type allow_empty_categories: bool :raises ValueError: When allow_empty_categories is False and no exclusive first-page strings were found for at least one Category.

predict(page: konfuzio_sdk.data.Page) konfuzio_sdk.data.Page

Predict a Page as first or non-first.

Parameters

page (Page) – A Page to receive first or non-first label.

Returns

A Page with a newly predicted is_first_page attribute.

Multimodal File Splitting Model

class konfuzio_sdk.trainer.file_splitting.MultimodalFileSplittingModel(categories: List[konfuzio_sdk.data.Category], text_processing_model: str = 'nlpaueb/legal-bert-small-uncased', scale: int = 2, *args, **kwargs)

Split a multi-Document file into a list of shorter Documents based on model’s prediction.

We use an approach suggested by Guha et al.(2022) that incorporates steps for accepting separate visual and textual inputs and processing them independently via the VGG19 architecture and LegalBERT model which is essentially a BERT-type architecture trained on domain-specific data, and passing the resulting outputs together to a Multi-Layered Perceptron.

Guha, A., Alahmadi, A., Samanta, D., Khan, M. Z., & Alahmadi, A. H. (2022). A Multi-Modal Approach to Digital Document Stream Segmentation for Title Insurance Domain. https://ieeexplore.ieee.org/stamp/stamp.jsp?arnumber=9684474

check_is_ready()

Check if Multimodal File Splitting Model instance is ready for inference.

A method checks that the instance of the Model has at least one Category passed as the input and that it is fitted to run prediction.

Raises
  • AttributeError – When no Categories are passed to the model.

  • AttributeError – When a model is not fitted to run a prediction.

fit(epochs: int = 10, use_gpu: bool = False, train_batch_size=8, *args, **kwargs)

Process the train and test data, initialize and fit the model.

Parameters
  • epochs (int) – A number of epochs to train a model on.

  • use_gpu (bool) – Run training on GPU if available.

predict(page: konfuzio_sdk.data.Page, use_gpu: bool = False) konfuzio_sdk.data.Page

Run prediction with the trained model.

Parameters
  • page (Page) – A Page to be predicted as first or non-first.

  • use_gpu (bool) – Run prediction on GPU if available.

Returns

A Page with possible changes in is_first_page attribute value.

reduce_model_weight()

Remove all non-strictly necessary parameters before saving.

remove_dependencies()

Remove dependencies before saving.

This is needed for proper saving of the model in lz4 compressed format – if the dependencies are not removed, the resulting pickle will be impossible to load.

restore_dependencies()

Restore removed dependencies after loading.

This is needed for proper functioning of a loaded model because we have previously removed these dependencies upon saving the model.

Textual File Splitting Model

class konfuzio_sdk.trainer.file_splitting.TextualFileSplittingModel(categories: List[konfuzio_sdk.data.Category], *args, **kwargs)

This model operates by taking input a multi-Document file and utilizing the DistilBERT model to make predictions regarding the segmentation of this document. Specifically, it aims to identify boundaries within the text where one document ends and another begins, effectively splitting the input into a list of shorter documents.

DistilBERT serves as the backbone of this model. DistilBERT offers a computationally efficient alternative to BERT, achieved through knowledge distillation while preserving much of BERT’s language understanding capabilities.

Sanh, V., Debut, L., Chaumond, J., & Wolf, T. (2019). DistilBERT, a distilled version of BERT: smaller, faster, cheaper and lighter https://arxiv.org/abs/1910.01108.

check_is_ready()

Check if Textual File Splitting Model instance is ready for inference.

A method checks that the instance of the Model has at least one Category passed as the input and that it is fitted to run prediction.

Raises
  • AttributeError – When no Categories are passed to the model.

  • AttributeError – When a model is not fitted to run a prediction.

fit(epochs: int = 5, eval_batch_size: int = 8, train_batch_size: int = 8, device: str = 'cpu', *args, **kwargs)

Process the train and test data, initialize and fit the model.

Parameters
  • epochs (int) – A number of epochs to train a model on.

  • eval_batch_size (int) – A batch size for evaluation.

  • train_batch_size (int) – A batch size for training.

  • device (str) – A device to run the prediction on. Possible values are ‘mps’, ‘cuda’, ‘cpu’.

Returns

A dictionary with evaluation results.

predict(page: konfuzio_sdk.data.Page, previous_page: Optional[konfuzio_sdk.data.Page] = None, device: str = 'cpu', *args, **kwargs) konfuzio_sdk.data.Page

Run prediction with the trained model.

Parameters
  • page (Page) – A Page to be predicted as first or non-first.

  • previous_page – The previous Page which would help give more context to the model

  • device (str) – A device to run the prediction on. Possible values are ‘mps’, ‘cuda’, ‘cpu’.

Returns

A Page with possible changes in is_first_page attribute value.

reduce_model_weight()

Remove all non-strictly necessary parameters before saving.

static remove_dependencies()

Remove dependencies before saving.

This is needed for proper saving of the model in lz4 compressed format – if the dependencies are not removed, the resulting pickle will be impossible to load.

static restore_dependencies()

Restore removed dependencies after loading.

This is needed for proper functioning of a loaded model because we have previously removed these dependencies upon saving the model.

Splitting AI

class konfuzio_sdk.trainer.file_splitting.SplittingAI(model)

Split a given Document and return a list of resulting shorter Documents.

evaluate_full(use_training_docs: bool = False, zero_division='warn') konfuzio_sdk.evaluate.FileSplittingEvaluation

Evaluate the Splitting AI’s performance.

Parameters

use_training_docs – If enabled, runs evaluation on the training data to define its quality; if disabled,

runs evaluation on the test data. :type use_training_docs: bool :param zero_division: Defines how to handle situations when precision, recall or F1 measure calculations result in zero division. Possible values: ‘warn’ – log a warning and assign a calculated metric a value of 0. 0 - assign a calculated metric a value of 0. ‘error’ – raise a ZeroDivisionError. None – assign None to a calculated metric. :return: Evaluation information for the model.

propose_split_documents(document: konfuzio_sdk.data.Document, return_pages: bool = False, inplace: bool = False, split_on_blank_pages: bool = False, device: str = 'cpu') List[konfuzio_sdk.data.Document]

Propose a set of resulting Documents from a single Document.

Parameters
  • document (Document) – An input Document to be split.

  • inplace (bool) – Whether changes are applied to the input Document, changing it, or to a deepcopy of it.

  • return_pages – A flag to enable returning a copy of an old Document with Pages marked .is_first_page on

splitting points instead of a set of Sub-Documents. :type return_pages: bool :param split_on_blank_pages: A flag to enable splitting on blank Pages. :type split_on_blank_pages: bool :return: A list of suggested new Sub-Documents built from the original Document or a list with a Document with Pages marked .is_first_page on splitting points.

AI Evaluation

[source]

Extraction AI Evaluation

class konfuzio_sdk.evaluate.ExtractionEvaluation(documents: List[Tuple[konfuzio_sdk.data.Document, konfuzio_sdk.data.Document]], strict: bool = True, use_view_annotations: bool = True, ignore_below_threshold: bool = True, zero_division='warn')

Calculated accuracy measures by using the detailed comparison on Span Level.

calculate()

Calculate and update the data stored within this Evaluation.

clf_f1(search=None) Optional[float]

Calculate the F1 Score of one the Label classifier.

Parameters

search – Parameter used to calculate the value for one Data object.

clf_fn(search=None) int

Return the Label classifier False Negatives of all Spans.

clf_fp(search=None) int

Return the Label classifier False Positives of all Spans.

clf_tp(search=None) int

Return the Label classifier True Positives of all Spans.

f1(search=None) Optional[float]

Calculate the F1 Score of one class.

Please note: As suggested by Opitz et al. (2021) use the arithmetic mean over individual F1 scores.

“F1 is often used with the intention to assign equal weight to frequent and infrequent classes, we recommend evaluating classifiers with F1 (the arithmetic mean over individual F1 scores), which is significantly more robust towards the error type distribution.”

Opitz, Juri, and Sebastian Burst. “Macro F1 and Macro F1.” arXiv preprint arXiv:1911.03347 (2021). https://arxiv.org/pdf/1911.03347.pdf

Parameters

search – Parameter used to calculate the value for one class.

Example:
  1. If you have three Documents, calculate the F-1 Score per Document and use the arithmetic mean.

  2. If you have three Labels, calculate the F-1 Score per Label and use the arithmetic mean.

  3. If you have three Labels and three documents, calculate six F-1 Scores and use the arithmetic mean.

fn(search=None) int

Return the False Negatives of all Spans.

fp(search=None) int

Return the False Positives of all Spans.

get_evaluation_data(search, allow_zero: bool = True) konfuzio_sdk.evaluate.EvaluationCalculator

Get precision, recall, f1, based on TP, FP, FN.

get_missing_vertical_merge()

Return Spans that should have been merged.

get_wrong_vertical_merge()

Return Spans that were wrongly merged vertically.

precision(search=None) Optional[float]

Calculate the Precision and see f1 to calculate imbalanced classes.

recall(search=None) Optional[float]

Calculate the Recall and see f1 to calculate imbalanced classes.

tn(search=None) int

Return the True Negatives of all Spans.

tokenizer_f1(search=None) Optional[float]

Calculate the F1 Score of one the tokenizer.

Parameters

search – Parameter used to calculate the value for one Data object.

tokenizer_fn(search=None) int

Return the tokenizer False Negatives of all Spans.

tokenizer_fp(search=None) int

Return the tokenizer False Positives of all Spans.

tokenizer_tp(search=None) int

Return the tokenizer True Positives of all Spans.

tp(search=None) int

Return the True Positives of all Spans.

Categorization AI Evaluation

class konfuzio_sdk.evaluate.CategorizationEvaluation(categories: List[konfuzio_sdk.data.Category], documents: List[Tuple[konfuzio_sdk.data.Document, konfuzio_sdk.data.Document]], zero_division='warn')

Calculated evaluation measures for the classification task of Document categorization.

property actual_classes: List[int]

List of ground truth Category IDs.

calculate()

Calculate and update the data stored within this Evaluation.

property category_ids: List[int]

List of Category IDs as class labels.

property category_names: List[str]

List of Category names as class names.

confusion_matrix() pandas.core.frame.DataFrame

Confusion matrix.

f1(category: Optional[konfuzio_sdk.data.Category]) Optional[float]

Calculate the global F1 Score or filter it by one Category.

fn(category: Optional[konfuzio_sdk.data.Category] = None) int

Return the False Negatives of all Documents.

fp(category: Optional[konfuzio_sdk.data.Category] = None) int

Return the False Positives of all Documents.

get_evaluation_data(search: Optional[konfuzio_sdk.data.Category] = None, allow_zero: bool = True) konfuzio_sdk.evaluate.EvaluationCalculator

Get precision, recall, f1, based on TP, TN, FP, FN.

Parameters
  • search (Category) – A Category to filter for, or None for getting global evaluation results.

  • allow_zero – If true, will calculate None for precision and recall when the straightforward application

of the formula would otherwise result in 0/0. Raises ZeroDivisionError otherwise. :type allow_zero: bool

precision(category: Optional[konfuzio_sdk.data.Category]) Optional[float]

Calculate the global Precision or filter it by one Category.

property predicted_classes: List[int]

List of predicted Category IDs.

recall(category: Optional[konfuzio_sdk.data.Category]) Optional[float]

Calculate the global Recall or filter it by one Category.

tn(category: Optional[konfuzio_sdk.data.Category] = None) int

Return the True Negatives of all Documents.

tp(category: Optional[konfuzio_sdk.data.Category] = None) int

Return the True Positives of all Documents.

File Splitting AI Evaluation

class konfuzio_sdk.evaluate.FileSplittingEvaluation(ground_truth_documents: List[konfuzio_sdk.data.Document], prediction_documents: List[konfuzio_sdk.data.Document], zero_division='warn')

Evaluate the quality of the filesplitting logic.

calculate()

Calculate metrics for the File Splitting logic.

calculate_metrics_by_category()

Calculate metrics by Category independently.

f1(search: Optional[konfuzio_sdk.data.Category] = None) float

Return F1-measure.

Parameters

search (Category) – display F1 measure within a certain Category.

Raises

KeyError – When the Category in search is not present in the Project from which the Documents are.

fn(search: Optional[konfuzio_sdk.data.Category] = None) int

Return first Pages incorrectly predicted as non-first.

Parameters

search (Category) – display false negatives within a certain Category.

Raises

KeyError – When the Category in search is not present in the Project from which the Documents are.

fp(search: Optional[konfuzio_sdk.data.Category] = None) int

Return non-first Pages incorrectly predicted as first.

Parameters

search (Category) – display false positives within a certain Category.

Raises

KeyError – When the Category in search is not present in the Project from which the Documents are.

get_evaluation_data(search: Optional[konfuzio_sdk.data.Category] = None, allow_zero: bool = True) konfuzio_sdk.evaluate.EvaluationCalculator

Get precision, recall, f1, based on TP, TN, FP, FN.

Parameters
  • search (Category) – display true positives within a certain Category.

  • allow_zero – If true, will calculate None for precision and recall when the straightforward application

of the formula would otherwise result in 0/0. Raises ZeroDivisionError otherwise. :type allow_zero: bool

precision(search: Optional[konfuzio_sdk.data.Category] = None) float

Return precision.

Parameters

search (Category) – display precision within a certain Category.

Raises

KeyError – When the Category in search is not present in the Project from which the Documents are.

recall(search: Optional[konfuzio_sdk.data.Category] = None) float

Return recall.

Parameters

search (Category) – display recall within a certain Category.

Raises

KeyError – When the Category in search is not present in the Project from which the Documents are.

tn(search: Optional[konfuzio_sdk.data.Category] = None) int

Return non-first Pages predicted as non-first.

Parameters

search (Category) – display true negatives within a certain Category.

Raises

KeyError – When the Category in search is not present in the Project from which the Documents are.

tp(search: Optional[konfuzio_sdk.data.Category] = None) int

Return correctly predicted first Pages.

Parameters

search (Category) – display true positives within a certain Category.

Raises

KeyError – When the Category in search is not present in the Project from which the Documents are.

Evaluation Calculator

class konfuzio_sdk.evaluate.EvaluationCalculator(tp: int = 0, fp: int = 0, fn: int = 0, tn: int = 0, zero_division='warn')

Calculate precision, recall, f1, based on TP, FP, FN.

property f1: Optional[float]

Apply F1-score formula.

Raises

ZeroDivisionError – When precision and recall are 0 and zero_division is set to ‘error’

metrics_logging()

Log metrics.

property precision: Optional[float]

Apply precision formula.

Raises

ZeroDivisionError – When TP and FP are 0 and zero_division is set to ‘error’

property recall: Optional[float]

Apply recall formula.

Raises

ZeroDivisionError – When TP and FN are 0 and zero_division is set to ‘error’

konfuzio_sdk.evaluate.grouped(group, target: str)

Define which of the correct element in the predicted group defines the “correct” group id_.

konfuzio_sdk.evaluate.compare(doc_a, doc_b, only_use_correct=False, use_view_annotations=False, ignore_below_threshold=False, strict=True) pandas.core.frame.DataFrame

Compare the Annotations of two potentially empty Documents wrt. to all Annotations.

Parameters
  • doc_a – Document which is assumed to be correct

  • doc_b – Document which needs to be evaluated

  • only_use_correct – Unrevised feedback in doc_a is assumed to be correct.

  • use_view_annotations – Will filter for top confidence annotations. Only available when strict=True. When use_view_annotations=True, it will compare only the highest confidence extractions to the ground truth Annotations. When False (default), it compares all extractions to the ground truth Annotations. This setting is ignored when strict=False, as the Non-Strict Evaluation needs to compare all extractions. For more details see https://help.konfuzio.com/modules/extractions/index.html#evaluation

  • ignore_below_threshold – Ignore Annotations below detection threshold of the Label (only affects TNs)

  • strict – Evaluate on a Character exact level without any postprocessing, an amount Span “5,55 ” will not be exact with “5,55”

Raises

ValueError – When the Category differs.

Returns

Evaluation DataFrame

Trainer utils

[source]

Add utility common functions and classes to be used for AI Training.

LoggerCallback

class konfuzio_sdk.trainer.utils.LoggerCallback

Custom callback for logger.info to be used in Trainer.

This callback is called by Trainer at the end of every epoch to log metrics. It replaces calling print and tqdm and calls logger.info instead.

on_log(args, state, control, logs=None, **kwargs)

Log losses and metrics when training or evaluating using Trainer.

BalancedLossTrainer

class konfuzio_sdk.trainer.utils.BalancedLossTrainer(model: Optional[Union[transformers.modeling_utils.PreTrainedModel, torch.nn.modules.module.Module]] = None, args: Optional[transformers.training_args.TrainingArguments] = None, data_collator: Optional[DataCollator] = None, train_dataset: Optional[torch.utils.data.dataset.Dataset] = None, eval_dataset: Optional[Union[torch.utils.data.dataset.Dataset, Dict[str, torch.utils.data.dataset.Dataset]]] = None, tokenizer: Optional[transformers.tokenization_utils_base.PreTrainedTokenizerBase] = None, model_init: Optional[Callable[[], transformers.modeling_utils.PreTrainedModel]] = None, compute_metrics: Optional[Callable[[transformers.trainer_utils.EvalPrediction], Dict]] = None, callbacks: Optional[List[transformers.trainer_callback.TrainerCallback]] = None, optimizers: Tuple[torch.optim.optimizer.Optimizer, torch.optim.lr_scheduler.LambdaLR] = (None, None), preprocess_logits_for_metrics: Optional[Callable[[torch.Tensor, torch.Tensor], torch.Tensor]] = None)

Custom trainer with custom loss to leverage class weights.

compute_loss(model, inputs, return_outputs=False)

Compute weighted cross-entropy loss to recompensate for unbalanced datasets.

log(logs: Dict[str, float]) None

Log logs on the various objects watching training.

Subclass and override this method to inject custom behavior.

param: logs (Dict[str, float]): The values to log.