API Reference¶
Reference guides are technical descriptions of the machinery and how to operate it. Reference material is information-oriented.
Data¶
Handle data from the API.
Span¶
- class konfuzio_sdk.data.Span(start_offset: int, end_offset: int, annotation: Optional[konfuzio_sdk.data.Annotation] = None, document: Optional[konfuzio_sdk.data.Document] = None, strict_validation: bool = True)
A Span is a sequence of characters or whitespaces without line break.
For more details see https://dev.konfuzio.com/sdk/explanations.html#span-concept
- bbox() konfuzio_sdk.data.Bbox
Calculate the bounding box of a text sequence.
- bbox_dict() Dict
Return Span Bbox info as a serializable Dict format for external integration with the Konfuzio Server.
- eval_dict()
Return any information needed to evaluate the Span.
- static get_sentence_from_spans(spans: Iterable[konfuzio_sdk.data.Span], punctuation=None) List[List[konfuzio_sdk.data.Span]]
Return a list of Spans corresponding to Sentences separated by Punctuation.
- property line_index: int
Return index of the line of the Span.
- property normalized
Normalize the offset string.
- property offset_string: Optional[str]
Calculate the offset string of a Span.
- property page: konfuzio_sdk.data.Page
Return Page of Span.
- regex()
Suggest a Regex for the offset string.
Bbox¶
- class konfuzio_sdk.data.Bbox(x0: int, x1: int, y0: int, y1: int, page: konfuzio_sdk.data.Page, validation=BboxValidationTypes.ALLOW_ZERO_SIZE)
A bounding box relates to an area of a Document Page.
For more details see https://dev.konfuzio.com/sdk/explanations.html#bbox-concept
What consistutes a valid Bbox changes depending on the value of the validation param. If ALLOW_ZERO_SIZE (default), it allows bounding boxes to have zero width or height. This option is available for compatibility reasons since some OCR engines can sometimes return character level bboxes with zero width or height. If STRICT, it doesn’t allow zero size bboxes. If DISABLED, it allows bboxes that have negative size, or coordinates beyond the Page bounds. For the default behaviour see https://dev.konfuzio.com/sdk/tutorials/data_validation/index.html
- Parameters
validation – One of ALLOW_ZERO_SIZE (default), STRICT, or DISABLED.
- property area
Return area covered by the Bbox.
- property bottom
Calculate the distance to the bottom of the Page.
- check_overlap(bbox: Union[konfuzio_sdk.data.Bbox, Dict]) bool
Verify if there’s overlap between two Bboxes.
- property dict_format: Dict
Obtain Bbox data as a dictionary.
- property document: konfuzio_sdk.data.Document
Get the Document the Bbox belongs to.
- classmethod from_image_size(x0, x1, y0, y1, page: konfuzio_sdk.data.Page) konfuzio_sdk.data.Bbox
Create a Bbox from image dimensions, based on the scaling of character Bboxes within the Document.
This method computes the coordinates of the bottom-left and top-right corners in a coordinate system where the y-axis is oriented from bottom to top, the x-axis is from left to right, and the scale is based on the page.
- Parameters
x0 – The x-coordinate of the top-left corner in an image-scaled system.
x1 – The x-coordinate of the bottom-right corner in an image-scaled system.
y0 – The y-coordinate of the top-left corner in an image-scaled system.
y1 – The y-coordinate of the bottom-right corner in an image-scaled system.
page – The Page object for reference in scaling.
- Returns
A Bbox object with rescaled dimensions.
- property top
Calculate the distance to the top of the Page.
- property x0_image
Get the x0 coordinate in the context of the Page image.
- property x1_image
Get the x1 coordinate in the context of the Page image.
- property y0_image
Get the y0 coordinate in the context of the Page image, in a top-down coordinate system.
- property y1_image
Get the y1 coordinate in the context of the Page image, in a top-down coordinate system.
Annotation¶
- class konfuzio_sdk.data.Annotation(document: konfuzio_sdk.data.Document, annotation_set_id: Optional[int] = None, annotation_set: Optional[konfuzio_sdk.data.AnnotationSet] = None, label: Optional[Union[int, konfuzio_sdk.data.Label]] = None, label_set_id: Optional[int] = None, label_set: Union[None, konfuzio_sdk.data.LabelSet] = None, is_correct: bool = False, revised: bool = False, normalized=None, id_: Optional[int] = None, spans=None, accuracy: Optional[float] = None, confidence: Optional[float] = None, created_by: Optional[int] = None, revised_by: Optional[int] = None, translated_string: Optional[str] = None, custom_offset_string: bool = False, offset_string: Optional[str] = None, *args, **kwargs)
Hold information that a Label, Label Set and Annotation Set has been assigned to and combines Spans.
For more details see https://dev.konfuzio.com/sdk/explanations.html#annotation-concept
- add_span(span: konfuzio_sdk.data.Span)
Add a Span to an Annotation incl. a duplicate check per Annotation.
- bbox() konfuzio_sdk.data.Bbox
Get Bbox encompassing all Annotation Spans.
- property bboxes: List[Dict]
Return the Bbox information for all Spans in serialized format.
This is useful for external integration (e.g. Konfuzio Server).”
- delete(delete_online: bool = True) None
Delete Annotation.
- Parameters
delete_online – Whether the Annotation is deleted online or only locally.
- property end_offset: int
Legacy: One Annotation can have multiple end offsets.
- property eval_dict: List[dict]
Calculate the Span information to evaluate the Annotation.
- get_link()
Get link to the Annotation in the SmartView.
- property is_multiline: int
Calculate if Annotation spans multiple lines of text.
- property label_set: konfuzio_sdk.data.LabelSet
Return Label Set of Annotation.
- lose_weight()
Delete data of the instance.
- property normalize: str
Provide one normalized offset string due to legacy.
- property offset_string: List[str]
View the string representation of the Annotation.
- property page: konfuzio_sdk.data.Page
Return Page of Annotation.
- regex()
Return regex of this Annotation.
- regex_annotation_generator(regex_list) List[konfuzio_sdk.data.Span]
Build Spans without Labels by regexes.
- Returns
Return sorted list of Spans by start_offset
- save(label_set_id=None, annotation_set_id=None, document_annotations: Optional[list] = None) bool
Save Annotation online.
If there is already an Annotation in the same place as the current one, we will not be able to save the current annotation.
In that case, we get the id_ of the original one to be able to track it. The verification of the duplicates is done by checking if the offsets and Label match with any Annotations online. To be sure that we are comparing with the information online, we need to have the Document updated. The update can be done after the request (per annotation) or the updated Annotations can be passed as input of the function (advisable when dealing with big Documents or Documents with many Annotations).
Specify label_set_id if you want to create an Annotation belonging to a new Annotation Set. Specify annotation_set_id if you want to add an Annotation to an existing Annotation Set. Do not specify both of them.
- Parameters
document_annotations – Annotations in the Document (list)
- Returns
True if new Annotation was created
- property spans: List[konfuzio_sdk.data.Span]
Return default entry to get all Spans of the Annotation.
- property start_offset: int
Legacy: One Annotation can have multiple start offsets.
- token_append(new_regex, regex_quality: int)
Append token if it is not a duplicate.
- tokens() List[str]
Create a list of potential tokens based on Spans of this Annotation.
Annotation Set¶
- class konfuzio_sdk.data.AnnotationSet(document, label_set: konfuzio_sdk.data.LabelSet, id_: Optional[int] = None, **kwargs)
An Annotation Set is a group of Annotations. The Labels of those Annotations refer to the same Label Set.
For more details see https://dev.konfuzio.com/sdk/explanations.html#annotation-set-concept
- annotations(use_correct: bool = True, ignore_below_threshold: bool = False)
All Annotations currently in this Annotation Set.
- property end_line_index: Optional[int]
Calculate ending line of this Annotation Set.
- property end_offset: Optional[int]
Calculate the end based on all Annotations above detection threshold currently in this AnnotationSet.
- property is_default: bool
Check if AnnotationSet is the default AnnotationSet of the Document.
- property start_line_index: Optional[int]
Calculate starting line of this Annotation Set.
- property start_offset: Optional[int]
Calculate the earliest start based on all Annotations above detection threshold in this AnnotationSet.
Label¶
- class konfuzio_sdk.data.Label(project: konfuzio_sdk.data.Project, id_: Optional[int] = None, text: Optional[str] = None, get_data_type_display: str = 'Text', text_clean: Optional[str] = None, description: Optional[str] = None, label_sets=None, has_multiple_top_candidates: bool = False, threshold: float = 0.1, *initial_data, **kwargs)
Group Annotations across Label Sets.
For more details see https://dev.konfuzio.com/sdk/explanations.html#label-concept
- add_label_set(label_set: konfuzio_sdk.data.LabelSet)
Add Label Set to label, if it does not exist.
- Parameters
label_set – Label Set to add
- annotations(categories: List[konfuzio_sdk.data.Category], use_correct=True, ignore_below_threshold=False) List[konfuzio_sdk.data.Annotation]
Return related Annotations. Consider that one Label can be used across Label Sets in multiple Categories.
- base_regex(category: konfuzio_sdk.data.Category, annotations: Optional[List[konfuzio_sdk.data.Annotation]] = None) str
Find the best combination of regex in the list of all regex proposed by Annotations.
- evaluate_regex(regex, category: konfuzio_sdk.data.Category, annotations: Optional[List[konfuzio_sdk.data.Annotation]] = None, regex_quality=0)
Evaluate a regex on Categories.
Type of regex allows you to group regex by generality
- Example:
Three Annotations about the birthdate in two Documents and one regex to be evaluated 1.doc: “My was born on the 12th of December 1980, you could also say 12.12.1980.” (2 Annotations) 2.doc: “My was born on 12.06.1997.” (1 Annotations) regex: dd.dd.dddd (without escaped characters for easier reading) stats: - total_correct_findings: 2 - correct_label_annotations: 3 - total_findings: 2 –> precision 100 % - num_docs_matched: 2 - Project.documents: 2 –> Document recall 100%
- find_regex(category: konfuzio_sdk.data.Category, max_findings_per_page=100) List[str]
Find the best combination of regex for Label with before and after context.
- get_probable_outliers(categories: List[konfuzio_sdk.data.Category], regex_search: bool = True, regex_worst_percentage: float = 0.1, confidence_search: bool = True, evaluation_data=None, normalization_search: bool = True) List[konfuzio_sdk.data.Annotation]
Get a list of Annotations that are outliers.
Outliers are determined by either of three logics or a combination of them applied: found by the worst regex, have the lowest confidence and/or are not normalizeable by the data type of a given Label.
- Parameters
categories (List[Category]) – Categories under which the search is done.
regex_search (bool) – Enable search by top worst regexes.
regex_worst_percentage (float) – A % of Annotations returned by the regexes.
confidence_search (bool) – Enable search by the lowest-confidence Annotations.
normalization_search (bool) – Enable search by normalizing Annotations by the Label’s data type.
- Raises
ValueError – When all search options are disabled.
- get_probable_outliers_by_confidence(evaluation_data, confidence: float = 0.5) List[konfuzio_sdk.data.Annotation]
Get a list of Annotations with the lowest confidence.
A method iterates over the list of Categories, returning the top N Annotations with the lowest confidence score.
- Parameters
evaluation_data (ExtractionEvaluation instance) – An instance of the ExtractionEvaluation class that contains predicted confidence scores.
confidence (float) – A level of confidence below which the Annotations are returned.
- get_probable_outliers_by_normalization(categories: List[konfuzio_sdk.data.Category]) List[konfuzio_sdk.data.Annotation]
Get a list of Annotations that do not pass normalization by the data type.
A method iterates over the list of Categories, returning the Annotations that do not fit into the data type of a Label (= have None returned in an attempt of the normalization by the Label’s data type).
- Parameters
categories (List[Category]) – Categories under which the search is done.
- get_probable_outliers_by_regex(categories: List[konfuzio_sdk.data.Category], use_test_docs: bool = False, top_worst_percentage: float = 0.1) List[konfuzio_sdk.data.Annotation]
Get a list of Annotations that are identified by the least precise regular expressions.
This method iterates over the list of Categories and Annotations within each Category, collecting all the regexes associated with them. It then evaluates these regexes and collects the top worst ones (i.e., those with the least True Positives). For each of these top worst regexes, it returns the Annotations found by them but not by the best regex for that label, potentially identifying them as outliers.
To detect outlier Annotations with multi-Spans, the method iterates over all the multi-Span Annotations under the Label and checks each Span that was not detected by the aforementioned worst regexes. If it is not found by any other regex in the Project, the entire Annotation is considered a potential outlier.
- Parameters
categories (List[Category]) – A list of Category objects under which the search is conducted.
use_test_docs (bool) – Indicates whether the evaluation of the regular expressions occurs on test Documents or training Documents.
top_worst_percentage (float) – A threshold for determining what percentage of the worst regexes’ output to return.
- Returns
A list of Annotation objects identified by the least precise regular expressions.
- Return type
List[Annotation]
- has_multiline_annotations(categories: Optional[List[konfuzio_sdk.data.Category]] = None) bool
Return if any Label annotations are multi-line.
- lose_weight()
Delete data of the instance.
- regex(categories: List[konfuzio_sdk.data.Category], update=False) Dict
Calculate regex to be used in the Extraction AI.
- spans(categories: List[konfuzio_sdk.data.Category], use_correct=True, ignore_below_threshold=False) List[konfuzio_sdk.data.Span]
Return all Spans belonging to an Annotation of this Label.
- spans_not_found_by_tokenizer(tokenizer, categories: List[konfuzio_sdk.data.Category], use_correct=False) List[konfuzio_sdk.data.Span]
Find Label Spans that are not found by a tokenizer.
Label Set¶
- class konfuzio_sdk.data.LabelSet(project, labels=None, id_: Optional[int] = None, name: Optional[str] = None, name_clean: Optional[str] = None, is_default=False, categories=None, has_multiple_annotation_sets=False, **kwargs)
A Label Set is a group of Labels.
For more details see https://dev.konfuzio.com/sdk/explanations.html#label-set-concept
- add_category(category: konfuzio_sdk.data.Category)
Add Category to the Label Set, if it does not exist.
- Parameters
category – Category to add to the Label Set
- add_label(label)
Add Label to Label Set, if it does not exist.
- Parameters
label – Label ID to be added
- get_target_names(use_separate_labels: bool)
Get target string name for Annotation Label classification.
Category¶
- class konfuzio_sdk.data.Category(project, id_: Optional[int] = None, name: Optional[str] = None, name_clean: Optional[str] = None, *args, **kwargs)
Group Documents in a Project.
For more details see https://dev.konfuzio.com/sdk/explanations.html#category-concept
- add_label_set(label_set)
Add Label Set to Category.
- property default_label_set
Get the default Label Set of the Category.
- documents()
Filter for Documents of this Category.
- exclusive_first_page_strings(tokenizer) set
Return a set of strings exclusive for first Pages of Documents within the Category.
- Parameters
tokenizer – A tokenizer to process Documents before gathering strings.
- property fallback_name: str
Turn the Category name to lowercase, remove parentheses along with their contents, and trim spaces.
- property labels
Return the Labels that belong to the Category and its Label Sets.
- test_documents()
Filter for test Documents of this Category.
Category Annotation¶
- class konfuzio_sdk.data.CategoryAnnotation(category: konfuzio_sdk.data.Category, confidence: Optional[float] = None, page: Optional[konfuzio_sdk.data.Page] = None, document: Optional[konfuzio_sdk.data.Document] = None, id_: Optional[int] = None)
Annotate the Category of a Page.
For more details see https://dev.konfuzio.com/sdk/explanations.html#category-annotation-concept
- property confidence: float
Get the confidence of this Category Annotation.
If the confidence was not set, it means it was never predicted by an AI. Thus, the returned value will be 0, unless it was set by a human, in which case it defaults to 1.
- Returns
Confidence between 0.0 and 1.0 included.
- set_revised() None
Set this Category Annotation as revised by human, and thus the correct one for the linked Page.
Document¶
- class konfuzio_sdk.data.Document(project: konfuzio_sdk.data.Project, id_: Optional[int] = None, file_url: Optional[str] = None, status: Optional[List[Union[int, str]]] = None, data_file_name: Optional[str] = None, is_dataset: Optional[bool] = None, dataset_status: Optional[int] = None, updated_at: Optional[str] = None, assignee: Optional[int] = None, category_template: Optional[int] = None, category: Optional[konfuzio_sdk.data.Category] = None, category_confidence: Optional[float] = None, category_is_revised: bool = False, text: Optional[str] = None, bbox: Optional[dict] = None, bbox_validation_type=None, pages: Optional[list] = None, update: bool = False, copy_of_id: Optional[int] = None, *args, **kwargs)
Access the information about one Document, which is available online.
For more details see https://dev.konfuzio.com/sdk/explanations.html#document-concept
- add_annotation(annotation: konfuzio_sdk.data.Annotation)
Add an Annotation to a Document.
The Annotation is only added to the Document if the data validation tests are passing for this Annotation. See https://dev.konfuzio.com/sdk/tutorials/data_validation/index.html
- Parameters
annotation – Annotation to add in the Document
- Returns
Input Annotation.
- add_annotation_set(annotation_set: konfuzio_sdk.data.AnnotationSet)
Add the Annotation Sets to the Document.
- add_page(page: konfuzio_sdk.data.Page)
Add a Page to a Document.
- annotation_sets(label_set: Optional[konfuzio_sdk.data.LabelSet] = None) List[konfuzio_sdk.data.AnnotationSet]
Return Annotation Sets of Documents.
- Parameters
label_set – Label Set for which to filter the Annotation Sets.
- Returns
Annotation Sets of Documents.
- annotations(label: Optional[konfuzio_sdk.data.Label] = None, use_correct: bool = True, ignore_below_threshold: bool = False, start_offset: int = 0, end_offset: Optional[int] = None, fill: bool = False) List[konfuzio_sdk.data.Annotation]
Filter available annotations.
- Parameters
label – Label for which to filter the Annotations.
use_correct – If to filter by correct Annotations.
ignore_below_threshold – To filter out Annotations with confidence below Label prediction threshold.
- Returns
Annotations in the document.
- property bbox_dict: Dict
Get a dictionary of Document’s character-level Bboxes.
- property bboxes: Dict[int, konfuzio_sdk.data.Bbox]
Use the cached bbox version.
- property category: konfuzio_sdk.data.Category
Return the Category of the Document.
The Category of a Document is only defined as long as all Pages have the same Category. Otherwise, the Document should probably be split into multiple Documents with a consistent Category assignment within their Pages, or the Category for each Page should be manually revised.
- property category_annotations: List[konfuzio_sdk.data.CategoryAnnotation]
Collect Category Annotations and average confidence across all Pages.
- Returns
List of Category Annotations, one for each Category.
- check_annotations(update_document: bool = False) bool
Check if Annotations are valid - no duplicates and correct Category.
- check_bbox() None
Run validation checks on the Document text and bboxes.
This is run when the Document is initialized, and usually it’s not needed to be run again because a Document’s text and bboxes are not expected to change within the Konfuzio Server.
You can run this manually instead if your pipeline allows changing the text or the bbox during the lifetime of a document. Will raise ValueError if the bboxes don’t match with the text of the document, or if bboxes have invalid coordinates (outside page borders) or invalid size (negative width or height).
This check is usually slow, and it can be made faster by calling Document.set_text_bbox_hashes() right after initializing the Document, which will enable running a hash comparison during this check.
- create_subdocument_from_page_range(start_page: konfuzio_sdk.data.Page, end_page: konfuzio_sdk.data.Page, include=False)
Create a shorter Document from a Page range of an initial Document.
- Parameters
start_page (Page) – A Page that the new sub-Document starts with.
end_page (Page) – A Page that the new sub-Document ends with, if include is True.
include (bool) – Whether end_page is included into the new sub-Document.
- Returns
A new sub-Document.
- property default_annotation_set: konfuzio_sdk.data.AnnotationSet
Return the default Annotation Set of the Document.
- delete(delete_online: bool = False)
Delete Document.
- delete_document_details()
Delete all local content information for the Document.
- property document_folder
Get the path to the folder where all the Document information is cached locally.
- download_document_details()
Retrieve data from a Document online in case Document has finished processing.
Data includes Document’s status, URL of its file, name of its file, date of las update, its text and pagination, Annotations and Annotation Sets; optionally, Category information.
- eval_dict(use_view_annotations=False, use_correct=False, ignore_below_threshold=False) List[dict]
Use this dict to evaluate Documents. The speciality: For every Span of an Annotation create one entry.
- evaluate_regex(regex, label: konfuzio_sdk.data.Label, annotations: Optional[List[konfuzio_sdk.data.Annotation]] = None)
Evaluate a regex based on the Document.
- property file_path
Return path to file.
- classmethod from_file(path: str, project: konfuzio_sdk.data.Project, dataset_status: int = 0, category_id: Optional[int] = None, callback_url: str = '', timeout: Optional[int] = None, sync: bool = True) konfuzio_sdk.data.Document
Initialize Document from file with synchronous API call.
This class method will wait for the document to be processed by the server and then return the new Document. This may take a bit of time. When uploading many Documents, it is advised to set the sync option to False method.
- Parameters
path – Path to file to be uploaded
project – If to filter by correct annotations
dataset_status – Dataset status of the Document (None: 0 Preparation: 1 Training: 2 Test: 3 Excluded: 4)
category_id – Category the Document belongs to (if unset, it will be assigned one by the server)
callback_url – Callback URL receiving POST call once extraction is done
timeout – Number of seconds to wait for response from the server
sync – Whether to wait for the file to be processed by the server
- Returns
New Document
- get_annotation_by_id(annotation_id: int) konfuzio_sdk.data.Annotation
Return an Annotation by ID, searching within the Document.
- Parameters
annotation_id – ID of the Annotation to get.
- get_annotation_set_by_id(id_: int) konfuzio_sdk.data.AnnotationSet
Return an Annotation Set by ID.
- Parameters
id – ID of the Annotation Set to get.
- get_annotations() List[konfuzio_sdk.data.Annotation]
Get Annotations of the Document.
- get_bbox() Dict
Get bbox information per character of file. We don’t store bbox as an attribute to save memory.
- Returns
Bounding box information per character in the Document.
- get_bbox_by_page(page_index: int) Dict[str, Dict]
Return list of all bboxes in a Page.
- get_file(ocr_version: bool = True, update: bool = False)
Get OCR version of the original file.
- Parameters
ocr_version – Bool to get the ocr version of the original file
update – Update the downloaded file even if it is already available
- Returns
Path to the selected file.
- get_images(update: bool = False)
Get Document Pages as PNG images.
- Parameters
update – Update the downloaded images even they are already available
- Returns
Path to PNG files.
- get_page_by_id(page_id: int, original: bool = False) konfuzio_sdk.data.Page
Get a Page by its ID.
- Parameters
page_id (int) – An ID of the Page to fetch.
- get_page_by_index(page_index: int)
Return the Page by index.
- get_segmentation(timeout: Optional[int] = None, num_retries: Optional[int] = None) List
Retrieve the segmentation results for the Document.
- Parameters
timeout – Number of seconds to wait for response from the server.
num_retries – Number of retries if the request fails.
- Returns
A list of segmentation results for each Page in the Document.
- get_text_in_bio_scheme(update=False) List[Tuple[str, str]]
Get the text of the Document in the BIO scheme.
- Parameters
update – Update the bio annotations even they are already available
- Returns
list of tuples with each word in the text and the respective label
- lose_weight()
Remove NO_LABEL, wrong and below threshold Annotations.
- property maximum_confidence_category: Optional[konfuzio_sdk.data.Category]
Get the human revised Category of this Document, or the highest confidence one if not revised.
- Returns
The found Category, or None if not present.
- property maximum_confidence_category_annotation: Optional[konfuzio_sdk.data.CategoryAnnotation]
Get the human revised Category Annotation of this Document, or the highest confidence one if not revised.
- Returns
The found Category Annotation, or None if not present.
- property no_label_annotation_set: konfuzio_sdk.data.AnnotationSet
Return the Annotation Set for project.no_label Annotations.
We need to load the Annotation Sets from Server first (call self.annotation_sets()). If we create the no_label_annotation_set in the first place, the data from the Server is not be loaded anymore because _annotation_sets will no longer be None.
- property number_of_lines: int
Calculate the number of lines.
- property number_of_pages: int
Calculate the number of Pages.
- property ocr_file_path
Return path to OCR PDF file.
- property ocr_ready
Check if Document OCR is ready.
- pages() List[konfuzio_sdk.data.Page]
Get Pages of Document.
- propose_splitting(splitting_ai) List
Propose splitting for a multi-file Document.
- Parameters
splitting_ai – An initialized SplittingAI class
- save()
Save all local changes to Document to server.
- save_meta_data()
Save local changes to Document metadata to server.
- set_bboxes(characters: Dict[int, konfuzio_sdk.data.Bbox])
Set character Bbox dictionary.
- set_category(category: konfuzio_sdk.data.Category) None
Set the Category of the Document and the Category of all of its Pages as revised.
- set_text_bbox_hashes() None
Update hashes of Document text and bboxes. Can be used for checking later on if any changes happened.
- spans(label: Optional[konfuzio_sdk.data.Label] = None, use_correct: bool = False, start_offset: int = 0, end_offset: Optional[int] = None, fill: bool = False) List[konfuzio_sdk.data.Span]
Return all Spans of the Document.
- property text
Get Document text. Once loaded stored in memory.
- update()
Update Document information.
- update_meta_data(assignee: Optional[int] = None, category_template: Optional[int] = None, category: Optional[konfuzio_sdk.data.Category] = None, data_file_name: Optional[str] = None, dataset_status: Optional[int] = None, status: Optional[List[Union[int, str]]] = None, **kwargs)
Update document metadata information.
- view_annotations(start_offset: int = 0, end_offset: Optional[int] = None) List[konfuzio_sdk.data.Annotation]
Get the best Annotations, where the Spans are not overlapping.
Page¶
- class konfuzio_sdk.data.Page(id_: Optional[int], document: konfuzio_sdk.data.Document, number: int, original_size: Tuple[float, float], image_size: Tuple[int, int] = (None, None), start_offset: Optional[int] = None, end_offset: Optional[int] = None, copy_of_id: Optional[int] = None)
Access the information about one Page of a Document.
For more details see https://dev.konfuzio.com/sdk/explanations.html#page-concept
- add_category_annotation(category_annotation: konfuzio_sdk.data.CategoryAnnotation)
Annotate a Page with a Category and confidence information.
- annotation_sets() List[konfuzio_sdk.data.AnnotationSet]
Show all Annotation Sets related to Annotations of the Page.
- annotations(label: Optional[konfuzio_sdk.data.Label] = None, use_correct: bool = True, ignore_below_threshold: bool = False, start_offset: int = 0, end_offset: Optional[int] = None, fill: bool = False) List[konfuzio_sdk.data.Annotation]
Get Page Annotations.
- property category: Optional[konfuzio_sdk.data.Category]
Get the Category of the Page, based on human revised Category Annotation, or on highest confidence.
- get_annotations_image(display_all: bool = False) <module 'PIL.Image' from '/usr/local/lib/python3.8/site-packages/PIL/Image.py'>
Get Document Page as PNG with Annotations shown.
- get_bbox()
Get bbox information per character of Page.
- get_category_annotation(category, add_if_not_present: bool = False) konfuzio_sdk.data.CategoryAnnotation
Retrieve the Category Annotation associated with a specific Category within this Page.
If no Category Annotation is found for the provided Category, one can be created based on the add_if_not_present argument.
- Parameters
category (Category) – The Category for which to retrieve the Category Annotation.
add_if_not_present (bool) – If True, a Category Annotation will be added to the current Page if none is found. If False, a dummy Category Annotation will be created, not linked to any Document or Page.
- Returns
The located or newly created Category Annotation.
- Return type
CategoryAnnotation
- get_image(update: bool = False) PIL.Image.Image
Get Page as a Pillow Image object.
The Page image is loaded from a PNG file at Page.image_path. If the file is not present, or if update is True, it will be downloaded from the Konfuzio Host. Alternatively, if you don’t want to use a file, you can provide the image as bytes to Page.image_bytes. Then call this method to convert the bytes into a Pillow Image. In every case, the return value of this method and the attribute Page.image will be a Pillow Image.
- Parameters
update – Whether to force download the Page PNG file.
- Returns
A Pillow Image object for this Page’s image.
- get_original_page() konfuzio_sdk.data.Page
Return an “original” Page in case the current Page is a copy without an ID.
An “original” Page is a Page from the Document that is not a copy and not a Virtual Document. This Page has an ID.
The method is used in the File Splitting pipeline to retain the original Document’s information in the Sub-Documents that were created from its splitting. The original Document is a Document that has an ID and is not a deepcopy.
- lines() List[konfuzio_sdk.data.Span]
Return sorted list of Spans for each line in the Page.
- property maximum_confidence_category_annotation: Optional[konfuzio_sdk.data.CategoryAnnotation]
Get the human revised Category Annotation of this Page, or the highest confidence one if not revised.
- Returns
The found Category Annotation, or None if not present.
- property number_of_lines: int
Calculate the number of lines in Page.
- set_category(category: konfuzio_sdk.data.Category) None
Set the Category of the Page.
- Parameters
category – The Category to set for the Page.
- spans(label: Optional[konfuzio_sdk.data.Label] = None, use_correct: bool = False, start_offset: int = 0, end_offset: Optional[int] = None, fill: bool = False) List[konfuzio_sdk.data.Span]
Return all Spans of the Page.
- property text
Get Document text corresponding to the Page.
- view_annotations() List[konfuzio_sdk.data.Annotation]
Get the best Annotations, where the Spans are not overlapping in Page.
Project¶
- class konfuzio_sdk.data.Project(id_: Optional[int], project_folder=None, update=False, max_ram=None, strict_data_validation: bool = True, credentials: dict = {}, **kwargs)
Access the information of a Project.
For more details see https://dev.konfuzio.com/sdk/explanations.html#project-concept
- add_category(category: konfuzio_sdk.data.Category)
Add Category to Project if it does not exist.
- Parameters
category – Category to add in the Project
- add_document(document: konfuzio_sdk.data.Document)
Add Document to Project, if it does not exist.
- add_label(label: konfuzio_sdk.data.Label)
Add Label to Project, if it does not exist.
- Parameters
label – Label to add in the Project
- add_label_set(label_set: konfuzio_sdk.data.LabelSet)
Add Label Set to Project, if it does not exist.
- Parameters
label_set – Label Set to add in the Project
- property ai_models
Return all AIs.
- create_project_metadata_dict() Dict
Create a dictionary that mimics the file categories_label_sets.json5 saved in the Project folder for restoring Projects within Bento containers.
- del_document_by_id(document_id: int, delete_online: bool = False) konfuzio_sdk.data.Document
Delete Document by its ID.
- delete()
Delete the Project folder.
- property documents
Return Documents with status training.
- property documents_folder: str
Calculate the regex folder of the Project.
- download_project_data(training_test_only=True, category_id=None) None
Migrate your Project to another HOST.
See https://dev.konfuzio.com/web/migration-between-konfuzio-server-instances/index.html
- property excluded_documents
Return Documents which have been excluded.
- export_project_data(include_ais=False, training_and_test_documents=True, documents_with_status=False, *args, **kwargs) None
Export the Project data including Training, Test Documents and AI models.
- Include_ais
Whether to include AI models in the export
- Training_and_test_documents
Whether to include training & test documents in the export.
- get(update=False, **kwargs)
Access meta information of the Project.
- Parameters
update – Update the downloaded information even it is already available
- get_categories(reload: bool = True)
Load Categories for all Label Sets in the Project.
- get_category_by_id(id_: int) konfuzio_sdk.data.Category
Return a Category by ID.
- Parameters
id – ID of the Category to get.
- get_category_by_name(category_name: Optional[str] = None) List[konfuzio_sdk.data.Category]
Get Category by the match to its name or clean_name.
- Parameters
category_name – A string to search the Category.
- Returns
A Category that has a matching name or a clean name.
- get_credentials(key)
Return the value of the key in the credentials dict or in the config file.
Returns None and emits a warning if the key is not found.
- Parameters
key – Key of the credential to get.
- get_document_by_id(document_id: int) konfuzio_sdk.data.Document
Return Document by its ID.
- get_label_by_id(id_: int) konfuzio_sdk.data.Label
Return a Label by ID.
- Parameters
id – ID of the Label to get.
- get_label_by_name(name: str) konfuzio_sdk.data.Label
Return Label by its name.
- get_label_set_by_id(id_: int) konfuzio_sdk.data.LabelSet
Return a Label Set by ID.
- Parameters
id – ID of the Label Set to get.
- get_label_set_by_name(name: str) konfuzio_sdk.data.LabelSet
Return a Label Set by ID.
- get_label_sets(reload=False)
Get LabelSets in the Project.
- get_labels(reload=False) List[konfuzio_sdk.data.Label]
Get ID and name of any Label in the Project.
- get_meta(reload=False)
Get the list of all Documents in the Project and their information.
- Returns
Information of the Documents in the Project.
- init_or_update_document(from_online=False, category_id=None)
Initialize or update Documents from local files to then decide about full, incremental or no update.
- Parameters
from_online – If True, all Document metadata info is first reloaded with latest changes in the server
- property label_sets
Return Project LabelSets.
- property labels
Return Project Labels.
- lose_weight()
Delete data of the instance.
- property max_ram
Return maximum memory used by AI models.
- property meta_data
Return Project meta data.
- property model_folder: str
Calculate the model folder of the Project.
- property no_status_documents
Return Documents with no status.
- property online_documents_dict: Dict
Return a dictionary of online documents using their id as key.
- property preparation_documents
Return Documents with status test.
- property project_folder: str
Calculate the data document_folder of the Project.
- property regex_folder: str
Calculate the regex folder of the Project.
- property test_documents
Return Documents with status test.
- property virtual_documents
Return Documents created virtually.
- write_meta_of_files(*args, **kwargs)
Overwrite meta-data of Documents in Project.
- write_project_files(*args, **kwargs)
Overwrite files with Project, Label, Label Set information.
API call wrappers¶
Connect to the Konfuzio Server to receive or send data.
TimeoutHTTPAdapter¶
- class konfuzio_sdk.api.TimeoutHTTPAdapter(timeout, *args, **kwargs)
Combine a retry strategy with a timeout strategy.
- build_response(req, resp)
Throw error for any HTTPError that is not part of the retry strategy.
- send(request, *args, **kwargs)
Use timeout policy if not otherwise declared.
- konfuzio_sdk.api.init_env(user: str, password: str, host: str = 'https://app.konfuzio.com', working_directory='/builds/konfuzio/dev', file_ending: str = '.env')¶
Add the .env file to the working directory.
- Parameters
user – Username to log in to the host
password – Password to log in to the host
host – URL of host.
working_directory – Directory where file should be added
file_ending – Ending of file.
- konfuzio_sdk.api.konfuzio_session(token: Optional[str] = None, timeout: Optional[int] = None, num_retries: Optional[int] = None, host: Optional[str] = None)¶
Create a session incl. Token to the KONFUZIO_HOST.
- Parameters
token – Konfuzio Token to connect to the host.
timeout – Timeout in seconds.
num_retries – Number of retries if the request fails.
host – Host to connect to.
- Returns
Request session.
- konfuzio_sdk.api.get_project_list(session=None)¶
Get the list of all Projects for the user.
- Parameters
session – Konfuzio session with Retry and Timeout policy
- Returns
Response object
- konfuzio_sdk.api.get_project_details(project_id: int, session=None) dict ¶
Get Project’s metadata.
- Parameters
project_id – ID of the Project
session – Konfuzio session with Retry and Timeout policy
- Returns
Project metadata
- konfuzio_sdk.api.get_project_labels(project_id: int, session=None) dict ¶
Get Project’s Labels.
- Parameters
project_id – An ID of a Project to get Labels from.
session – Konfuzio session with Retry and Timeout policy
- konfuzio_sdk.api.get_project_label_sets(project_id: int, session=None) dict ¶
Get Project’s Label Sets.
- Parameters
project_id – An ID of a Project to get Label Sets from.
session – Konfuzio session with Retry and Timeout policy
- konfuzio_sdk.api.create_new_project(project_name, session=None)¶
Create a new Project for the user.
- Parameters
project_name – name of the project you want to create
session – Konfuzio session with Retry and Timeout policy
- Returns
Response object
- konfuzio_sdk.api.get_document_details(document_id: int, session=None)¶
Use the text-extraction server to retrieve the data from a document.
- Parameters
document_id – ID of the document
session – Konfuzio session with Retry and Timeout policy
- Returns
Data of the document.
- konfuzio_sdk.api.get_document_annotations(document_id: int, session=None)¶
Get Annotations of a Document.
- Parameters
document_id – ID of the Document.
session – Konfuzio session with Retry and Timeout policy
- Returns
List of the Annotations of the Document.
- konfuzio_sdk.api.get_document_bbox(document_id: int, session=None)¶
Get Bboxes for a Document.
- Parameters
document_id – ID of the Document.
session – Konfuzio session with Retry and Timeout policy
- Returns
List of Bboxes of characters in the Document
- konfuzio_sdk.api.get_page_image(document_id: int, page_number: int, session=None, thumbnail: bool = False)¶
Load image of a Page as Bytes.
- Parameters
page_number – Number of the Page
thumbnail – Download Page image as thumbnail
session – Konfuzio session with Retry and Timeout policy
- Returns
Bytes of the Image.
- konfuzio_sdk.api.post_document_annotation(document_id: int, spans: List, label_id: int, confidence: Optional[float] = None, revised: bool = False, is_correct: bool = False, session=None, **kwargs)¶
Add an Annotation to an existing document.
You must specify either annotation_set_id or label_set_id.
Use annotation_set_id if an Annotation Set already exists. You can find the list of existing Annotation Sets by using the GET endpoint of the Document.
Using label_set_id will create a new Annotation Set associated with that Label Set. You can only do this if the Label Set has has_multiple_sections set to True.
- Parameters
document_id – ID of the file
spans – Spans that constitute the Annotation
label_id – ID of the Label
confidence – Confidence of the Annotation still called Accuracy by text-annotation
revised – If the Annotation is revised or not (bool)
is_correct – If the Annotation is corrected or not (bool)
session – Konfuzio session with Retry and Timeout policy
- Returns
Response status.
- konfuzio_sdk.api.change_document_annotation(annotation_id: int, session=None, **kwargs)¶
Change something about an Annotation.
- Parameters
annotation_id – ID of an Annotation to be changed
session – Konfuzio session with Retry and Timeout policy
- Returns
Response status.
- konfuzio_sdk.api.delete_document_annotation(annotation_id: int, session=None, delete_from_database: bool = False, **kwargs)¶
Delete a given Annotation of the given document.
For AI training purposes, we recommend setting delete_from_database to False if you don’t want to remove Annotation permanently. This creates a negative feedback Annotation and does not remove it from the database.
- Parameters
annotation_id – ID of the annotation
session – Konfuzio session with Retry and Timeout policy
- Returns
Response status.
- konfuzio_sdk.api.update_document_konfuzio_api(document_id: int, session=None, **kwargs)¶
Update an existing Document via Konfuzio API.
- Parameters
document_id – ID of the document
session – Konfuzio session with Retry and Timeout policy
- Returns
Response status.
- konfuzio_sdk.api.download_file_konfuzio_api(document_id: int, ocr: bool = True, session=None)¶
Download file from the Konfuzio server using the Document id_.
Django authentication is form-based, whereas DRF uses BasicAuth.
- Parameters
document_id – ID of the document
ocr – Bool to get the ocr version of the document
session – Konfuzio session with Retry and Timeout policy
- Returns
The downloaded file.
- konfuzio_sdk.api.get_results_from_segmentation(doc_id: int, project_id: int, session=None) List[List[dict]] ¶
Get bbox results from segmentation endpoint.
- Parameters
doc_id – ID of the document
project_id – ID of the Project.
session – Konfuzio session with Retry and Timeout policy
- konfuzio_sdk.api.get_project_categories(project_id: Optional[int] = None, session=None) List[Dict] ¶
Get a list of Categories of a Project.
- Parameters
project_id – ID of the Project.
session – Konfuzio session with Retry and Timeout policy
- konfuzio_sdk.api.upload_ai_model(ai_model_path: str, project_id: Optional[int] = None, category_id: Optional[int] = None, session=None)¶
Upload an ai_model to the text-annotation server.
- Parameters
ai_model_path – Path to the ai_model
project_id – An ID of a Project to which the AI is uploaded. Needed for the File Splitting and Categorization
AIs because they function on a Project level. :param category_id: An ID of a Category on which the AI is trained. Needed for the Extraction AI because it functions on a Category level and requires a single Category. :param session: session to connect to server :raises: ValueError when neither project_id nor category_id is specified. :raises: HTTPError when a request is unsuccessful. :return:
- konfuzio_sdk.api.delete_ai_model(ai_model_id: int, ai_type: str, session=None)¶
Delete an AI model from the server.
- Parameters
ai_model_id – an ID of the model to be deleted.
ai_type – Should be one of the following: ‘filesplitting’, ‘extraction’, ‘categorization’.
session – session to connect to the server.
- Raises
ValueError if ai_type is not correctly specified.
- Raises
ConnectionError when a request is unsuccessful.
- konfuzio_sdk.api.update_ai_model(ai_model_id: int, ai_type: str, patch: bool = True, session=None, **kwargs)¶
Update an AI model from the server.
- Parameters
ai_model_id – an ID of the model to be updated.
ai_type – Should be one of the following: ‘filesplitting’, ‘extraction’, ‘categorization’.
patch – If true, adds info instead of replacing it.
session – session to connect to the server.
- Raises
ValueError if ai_type is not correctly specified.
- Raises
HTTPError when a request is unsuccessful.
- konfuzio_sdk.api.get_all_project_ais(project_id: int, session=None) dict ¶
Fetch all types of AIs for a specific project.
- Parameters
project_id – ID of the Project
session – Konfuzio session with Retry and Timeout policy
host – Konfuzio host
- Returns
Dictionary with lists of all AIs for a specific project
- konfuzio_sdk.api.export_ai_models(project, session=None, category_id=None) int ¶
Export all AI Model files for a specific Project.
- Param
project: Konfuzio Project
- Param
category_id: Only select AIs for a specific Category of a Project
- Returns
Number of exported AIs
CLI tools¶
Command Line interface to the konfuzio_sdk package.
- konfuzio_sdk.cli.parse_args(parser)¶
Parse command line arguments using sub-parsers for each command.
- konfuzio_sdk.cli.credentials(args)¶
Retrieve user input or use CLI arguments.
Extras¶
Initialize AI-related dependencies safely.
PackageWrapper¶
- class konfuzio_sdk.extras.PackageWrapper(package_name: str, required_for_modules: Optional[List[str]] = None)
Heavy dependencies are encapsulated and handled if they are not part of the lightweight SDK installation.
ModuleWrapper¶
- class konfuzio_sdk.extras.ModuleWrapper(module: str)
Handle missing dependencies’ classes to avoid metaclass conflict.
Normalization¶
Convert the Span according to the data_type of the Annotation.
- konfuzio_sdk.normalize.normalize_to_float(offset_string: str) Optional[float] ¶
Given an offset_string: str this function tries to translate the offset-string to a number.
- konfuzio_sdk.normalize.normalize_to_positive_float(offset_string: str) Optional[float] ¶
Given an offset_string this function tries to translate the offset-string to an absolute number (ignores +/-).
- konfuzio_sdk.normalize.normalize_to_percentage(offset_string: str) Optional[float] ¶
Given an Annotation this function tries to translate the offset-string to an percentage -a float between 0 -1.
- konfuzio_sdk.normalize.normalize_to_date(offset_string: str) Optional[str] ¶
Given an Annotation this function tries to translate the offset-string to a date in the format ‘DD.MM.YYYY’.
- konfuzio_sdk.normalize.normalize_to_bool(offset_string: str)¶
Given an offset_string this function tries to translate the offset-string to a bool.
- konfuzio_sdk.normalize.roman_to_float(offset_string: str) Optional[float] ¶
Convert a Roman numeral to an integer.
- konfuzio_sdk.normalize.normalize(offset_string, data_type)¶
Wrap all normalize functionality.
Utils¶
Utils for the konfuzio sdk package.
- konfuzio_sdk.utils.sdk_isinstance(instance, klass)¶
Implement a custom isinstance which is compatible with cloudpickle saving by value.
When using cloudpickle with “register_pickle_by_value” the classes of “konfuzio.data” will be loaded in the “types” module. For this case the builtin method “isinstance” will return False because it tries to compare “types.Document” with “konfuzio_sdk.data.Document”.
- konfuzio_sdk.utils.exception_or_log_error(msg: str, handler: str = 'sdk', fail_loudly: typing.Optional[bool] = True, exception_type: typing.Optional[typing.Type[Exception]] = <class 'ValueError'>) None ¶
Log error or raise an exception.
This function is needed to control error handling in production. If fail_loudly is set to True, the function raises an exception to type exception_type with a message and handler in the format `{“message” : msg,
“handler” : handler}`.
If fail_loudly is set to False, the function logs an error with msg using the logger.
- Parameters
msg – (str): The error message to be logged or raised.
handler – (str): The handler associated with the error. Defaults to “sdk”
fail_loudly – A flag indicating whether to raise an exception or log the error. Defaults to True.
exception_type – The type of exception to be raised. Defaults to ValueError.
- Returns
None
- konfuzio_sdk.utils.get_id(include_time: bool = False) str ¶
Generate a unique ID.
- Parameters
include_time – Bool to include the time in the unique ID
- Returns
Unique ID
- konfuzio_sdk.utils.is_file(file_path, raise_exception=True, maximum_size=100000000, allow_empty=False) bool ¶
Check if file is available or raise error if it does not exist.
- Parameters
file_path – Path to the file to be checked
raise_exception – Will raise an exception if file is not available
maximum_size – Maximum size of the expected file, default < 100 mb
allow_empty – Bool to allow empty files
- Returns
True or false depending on the existence of the file
- konfuzio_sdk.utils.memory_size_of(obj) int ¶
Return memory size of object in bytes.
- konfuzio_sdk.utils.normalize_memory(memory: Union[None, str]) Optional[int] ¶
Return memory size in human-readable form to int of number of bytes.
- Parameters
memory – Memory size in human readable form (e.g. “50MB”).
- Returns
int of bytes if valid, else None
- konfuzio_sdk.utils.get_timestamp(konfuzio_format='%Y-%m-%d-%H-%M-%S') str ¶
Return formatted timestamp.
- Parameters
konfuzio_format – Format of the timestamp (e.g. year-month-day-hour-min-sec)
- Returns
Timestamp
- konfuzio_sdk.utils.load_image(input_file: Union[str, _io.BytesIO])¶
Load an image by path or via io.Bytes, e.g. via download by URL.
- Parameters
input_file – Path to image or image in bytes format
- Returns
Loaded image
- konfuzio_sdk.utils.get_file_type(input_file: Optional[Union[str, _io.BytesIO, bytes]] = None) str ¶
Get the type of a file.
- Parameters
input_file – Path to the file or file in bytes format
- Returns
Name of file type
- konfuzio_sdk.utils.get_file_type_and_extension(input_file: Optional[Union[str, _io.BytesIO, bytes]] = None) Tuple[str, str] ¶
Get the type of a file via the filetype library, which checks the magic bytes to see the internal format.
- Parameters
input_file – Path to the file or file in bytes format
- Returns
Name of file type
- konfuzio_sdk.utils.does_not_raise()¶
Serve a complement to raise, no-op context manager does_not_raise.
docs.pytest.org/en/latest/example/parametrize.html#parametrizing-conditional-raising
- konfuzio_sdk.utils.convert_to_bio_scheme(document) List[Tuple[str, str]] ¶
Mark all the entities in the text as per the BIO scheme.
The splitting is using the sequence of words, expecting some characters like “.” a separate token.
Hello O , O it O ‘s O Helm B-ORG und I-ORG Nagel I-ORG . O
- Parameters
document – Document to be converted into the bio scheme
- Returns
list of tuples with each word in the text an the respective Label
- konfuzio_sdk.utils.slugify(value)¶
Taken from https://github.com/django/django/blob/master/django/utils/text.py.
Convert to ASCII if ‘allow_unicode’ is False. Convert spaces or repeated dashes to single dashes. Remove characters that aren’t alphanumerics, underscores, or hyphens. Convert to lowercase. Also strip leading and trailing whitespace, dashes, and underscores.
- konfuzio_sdk.utils.amend_file_name(file_name: str, append_text: str = '', append_separator: str = '_', new_extension: Optional[str] = None) str ¶
Append text to a filename in front of extension.
example found here: https://stackoverflow.com/a/37487898
- Parameters
new_extension – Change the file extension
file_path – Name of a file, e.g. file.pdf
append_text – Text you you want to append between file name ane extension
- Returns
extended path to file
- konfuzio_sdk.utils.amend_file_path(file_path: str, append_text: str = '', append_separator: str = '_', new_extension: Optional[str] = None)¶
Similar to amend_file_name however the file_name is interpreted as a full path.
- Parameters
new_extension – Change the file extension
file_path – Name of a file, e.g. file.pdf
append_text – Text you you want to append between file name ane extension
- Returns
extended path to file
- konfuzio_sdk.utils.get_sentences(text: str, offsets_map: Optional[dict] = None, language: str = 'german') List[dict] ¶
Split a text into sentences using the sentence tokenizer from the package nltk.
- Parameters
text – Text to split into sentences
offsets_map – mapping between the position of the character in the input text and the offset in the text
of the document :param language: language of the text :return: List with a dict per sentence with its text and its start and end offsets in the text of the document.
- konfuzio_sdk.utils.map_offsets(characters_bboxes: list) dict ¶
Map the position of the character to its offset.
E.g.: characters: x, y, z, w characters offsets: 2, 3, 20, 22
The first character (x) has the offset 2. The fourth character (w) has the offset 22. …
offsets_map: {0: 2, 1: 3, 2: 20, 3: 22}
- Parameters
characters_bboxes – Bounding boxes information of the characters.
- Returns
Mapping of the position of the characters and its offsets.
- konfuzio_sdk.utils.detectron_get_paragraph_bboxes(detectron_document_results: List[List[Dict]], document) List[List[Bbox]] ¶
Call detectron Bbox corresponding to each paragraph.
- konfuzio_sdk.utils.iter_before_and_after(iterable, before=1, after=None, fill=None)¶
Iterate and provide before and after element. Generalized from http://stackoverflow.com/a/1012089.
- konfuzio_sdk.utils.get_sdk_version()¶
Get a version of current Konfuzio SDK used.
- konfuzio_sdk.utils.get_spans_from_bbox(selection_bbox: Bbox) List[Span] ¶
Get a list of Spans for all the text contained within a Bbox.
- konfuzio_sdk.utils.normalize_name(value: str) str ¶
Normalize names for different Konfuzio concepts by removing slashes and checking for non-ascii symbols.
- Parameters
value – A name to be normalized.
Tokenizers¶
Generic tokenizer.
Abstract Tokenizer¶
- class konfuzio_sdk.tokenizer.base.AbstractTokenizer
Abstract definition of a Tokenizer.
- evaluate(document: konfuzio_sdk.data.Document) pandas.core.frame.DataFrame
Compare a Document with its tokenized version.
- Parameters
document – Document to evaluate
- Returns
Evaluation DataFrame
- evaluate_dataset(dataset_documents: List[konfuzio_sdk.data.Document]) konfuzio_sdk.evaluate.ExtractionEvaluation
Evaluate the tokenizer on a dataset of documents.
- Parameters
dataset_documents – Documents to evaluate
- Returns
ExtractionEvaluation instance
- abstract found_spans(document: konfuzio_sdk.data.Document) List[konfuzio_sdk.data.Span]
Find all Spans in a Document that can be found by a Tokenizer.
- get_runtime_info() pandas.core.frame.DataFrame
Get the processing runtime information as DataFrame.
- Returns
processing time Dataframe containing the processing duration of all steps of the tokenization.
- lose_weight()
Delete processing steps.
- missing_spans(document: konfuzio_sdk.data.Document) List[konfuzio_sdk.data.Span]
Apply a Tokenizer on a Document and find all Spans that cannot be found.
Use this approach to sequentially work on remaining Spans after a Tokenizer ran on a List of Documents.
- Parameters
document – A Document
- Returns
A list containing all missing Spans.
- abstract tokenize(document: konfuzio_sdk.data.Document) konfuzio_sdk.data.Document
Create Annotations with 1 Span based on the result of the Tokenizer.
- Parameters
document – Document to tokenize, can have been tokenized before
- Returns
Document with Spans created by the Tokenizer.
List Tokenizer¶
- class konfuzio_sdk.tokenizer.base.ListTokenizer(tokenizers: List[konfuzio_sdk.tokenizer.base.AbstractTokenizer])
Use multiple tokenizers.
- found_spans(document: konfuzio_sdk.data.Document) List[konfuzio_sdk.data.Span]
Run found_spans in the given order on a Document.
- lose_weight()
Delete processing steps.
- span_match(span: konfuzio_sdk.data.Span) bool
Run span_match in the given order.
- tokenize(document: konfuzio_sdk.data.Document) konfuzio_sdk.data.Document
Run tokenize in the given order on a Document.
Rule Based Tokenizer¶
Regex tokenizers.
Regex Tokenizer¶
- class konfuzio_sdk.tokenizer.regex.RegexTokenizer(regex: str)
Tokenizer based on a single regex.
- found_spans(document: konfuzio_sdk.data.Document) List[konfuzio_sdk.data.Span]
Find Spans found by the Tokenizer and add Tokenizer info to Span.
- Parameters
document – Document with Annotation to find.
- Returns
List of Spans found by the Tokenizer.
- span_match(span: konfuzio_sdk.data.Span) bool
Check if Span is detected by Tokenizer.
- tokenize(document: konfuzio_sdk.data.Document) konfuzio_sdk.data.Document
Create Annotations with 1 Span based on the result of the Tokenizer.
- Parameters
document – Document to tokenize, can have been tokenized before
- Returns
Document with Spans created by the Tokenizer.
Regex tokenizers.
- class konfuzio_sdk.tokenizer.regex.CapitalizedTextTokenizer¶
Tokenizer based on capitalized text.
- Example:
“Company is Company A&B GmbH now” -> “Company A&B GmbH”
- class konfuzio_sdk.tokenizer.regex.ColonOrWhitespacePrecededTokenizer¶
Tokenizer based on text preceded by colon.
- Example:
“write to: name” -> “name”
- class konfuzio_sdk.tokenizer.regex.ColonPrecededTokenizer¶
Tokenizer based on text preceded by colon.
- Example:
“write to: name” -> “name”
- class konfuzio_sdk.tokenizer.regex.ConnectedTextTokenizer¶
Tokenizer based on text connected by 1 whitespace.
- Example:
r”This is na description. Occupies a paragraph.” -> “This is”, “a description. Occupies a paragraph.”
- class konfuzio_sdk.tokenizer.regex.LineUntilCommaTokenizer¶
Tokenizer based on text preceded by colon.
- Example:
“n Company und A&B GmbH,n” -> “Company und A&B GmbH”
- class konfuzio_sdk.tokenizer.regex.NonTextTokenizer¶
Tokenizer based on non text - numbers and separators.
- Example:
“date 01. 01. 2022” -> “01. 01. 2022”
- class konfuzio_sdk.tokenizer.regex.NumbersTokenizer¶
Tokenizer based on numbers.
- Example:
“N. 1242022 123 ” -> “1242022 123”
- class konfuzio_sdk.tokenizer.regex.RegexTokenizer(regex: str)¶
Tokenizer based on a single regex.
- found_spans(document: konfuzio_sdk.data.Document) List[konfuzio_sdk.data.Span] ¶
Find Spans found by the Tokenizer and add Tokenizer info to Span.
- Parameters
document – Document with Annotation to find.
- Returns
List of Spans found by the Tokenizer.
- span_match(span: konfuzio_sdk.data.Span) bool ¶
Check if Span is detected by Tokenizer.
- tokenize(document: konfuzio_sdk.data.Document) konfuzio_sdk.data.Document ¶
Create Annotations with 1 Span based on the result of the Tokenizer.
- Parameters
document – Document to tokenize, can have been tokenized before
- Returns
Document with Spans created by the Tokenizer.
- class konfuzio_sdk.tokenizer.regex.WhitespaceNoPunctuationTokenizer¶
Tokenizer based on whitespaces without punctuation.
- Example:
“street Name 1-2b,” -> “street”, “Name”, “1-2b”
- class konfuzio_sdk.tokenizer.regex.WhitespaceTokenizer¶
Tokenizer based on whitespaces.
- Example:
“street Name 1-2b,” -> “street”, “Name”, “1-2b,”
Sentence and Paragraph tokenizers.
Paragraph Tokenizer¶
- class konfuzio_sdk.tokenizer.paragraph_and_sentence.ParagraphTokenizer(mode: str = 'detectron', line_height_ratio: float = 0.8, height: Optional[Union[int, float]] = None, create_detectron_labels: bool = False)
Tokenizer splitting Document into paragraphs.
- found_spans(document: konfuzio_sdk.data.Document)
Sentence found spans.
- tokenize(document: konfuzio_sdk.data.Document) konfuzio_sdk.data.Document
Create one multiline Annotation per paragraph detected.
Sentence Tokenizer¶
- class konfuzio_sdk.tokenizer.paragraph_and_sentence.SentenceTokenizer(mode: str = 'detectron', line_height_ratio: float = 0.8, height: Optional[Union[int, float]] = None, create_detectron_labels: bool = False)
Tokenizer splitting Document into Sentences.
- found_spans(document: konfuzio_sdk.data.Document)
Sentence found spans.
- tokenize(document: konfuzio_sdk.data.Document) konfuzio_sdk.data.Document
Create one multiline Annotation per sentence detected.
Extraction AI¶
Extract information from Documents.
Conventional template matching based approaches fail to generalize well to document images of unseen templates, and are not robust against text recognition errors.
We follow the approach proposed by Sun et al. (2021) to encode both the visual and textual features of detected text regions, and edges of which represent the spatial relations between neighboring text regions. Their experiments validate that all information including visual features, textual features and spatial relations can benefit key information extraction.
We reduce the hardware requirements from 1 NVIDIA Titan X GPUs with 12 GB memory to a 1 CPU and 16 GB memory by replacing the end-to-end pipeline into two parts.
Sun, H., Kuang, Z., Yue, X., Lin, C., & Zhang, W. (2021). Spatial Dual-Modality Graph Reasoning for Key Information Extraction. arXiv. https://doi.org/10.48550/ARXIV.2103.14470
Base Model¶
- class konfuzio_sdk.trainer.information_extraction.BaseModel
Base model to define common methods for all AIs.
- abstract check_is_ready()
Check if the Model is ready for inference.
- ensure_model_memory_usage_within_limit(max_ram: Optional[str] = None)
Ensure that a model is not exceeding allowed max_ram.
- Parameters
max_ram (str) – Specify maximum memory usage condition to save model.
- abstract property entrypoint_methods: dict
Create a dict of methods for this class that are exposed via API.
- abstract static has_compatible_interface(other)
Validate that an instance of an AI implements the same interface defined by this AI class.
- Parameters
other – An instance of an AI to compare with.
- static load_model(pickle_path: str, max_ram: Optional[str] = None)
Load a previously saved instance of the model.
- Parameters
pickle_path (str) – Path to the pickled model.
- Raises
FileNotFoundError – If the path is invalid.
OSError – When the data is corrupted or invalid and cannot be loaded.
TypeError – When the loaded pickle isn’t recognized as a Konfuzio AI model.
- Returns
Extraction AI model.
- property name
Model class name.
- name_lower()
Convert class name to machine-readable name.
- abstract property pkl_file_path
Generate a path for a resulting pickle file.
- abstract property pkl_name
Generate a unique extension-less name for a resulting pickle file.
- reduce_model_weight()
Remove all non-strictly necessary parameters before saving.
- save(output_dir: Optional[str] = None, include_konfuzio=True, reduce_weight=True, compression: str = 'lz4', keep_documents=False, max_ram=None)
Save the label model as a compressed pickle object to the release directory.
Saving is done by: getting the serialized pickle object (via cloudpickle), “optimizing” the serialized object with the built-in pickletools.optimize function (see: https://docs.python.org/3/library/pickletools.html), saving the optimized serialized object.
We then compress the pickle file using shutil.copyfileobject which writes in chunks to avoid loading the entire pickle file in memory.
Finally, we delete the cloudpickle file and are left with the compressed pickle file which has a .pkl.lz4 or .pkl.bz2 extension.
For more info on pickle serialization and including dependencies read https://github.com/cloudpipe/cloudpickle#overriding-pickles-serialization-mechanism-for-importable-constructs
- Parameters
output_dir – Folder to save AI model in. If None, the default Project folder is used.
include_konfuzio – Enables pickle serialization as a value, not as a reference.
reduce_weight – Remove all non-strictly necessary parameters before saving.
compression – Compression algorithm to use. Default is lz4, bz2 is also supported.
max_ram – Specify maximum memory usage condition to save model.
- Raises
MemoryError – When the size of the model in memory is greater than the maximum value.
- Returns
Path of the saved model file.
- save_bento(build=True, output_dir=None) Union[None, tuple]
Save AI as a BentoML model in the local store.
- Parameters
build – Bundle the model into a BentoML service and store it in the local store.
output_dir – If present, a .bento archive will also be saved to this directory.
- Returns
None if build=False, otherwise a tuple of (saved_bento, archive_path).
- abstract property temp_pkl_file_path
Generate a path for temporary pickle file.
AbstractExtractionAI¶
- class konfuzio_sdk.trainer.information_extraction.AbstractExtractionAI(category: konfuzio_sdk.data.Category, *args, **kwargs)
Parent class for all Extraction AIs, to extract information from unstructured human-readable text.
- static add_extractions_as_annotations(extractions: pandas.core.frame.DataFrame, document: konfuzio_sdk.data.Document, label: konfuzio_sdk.data.Label, label_set: konfuzio_sdk.data.LabelSet, annotation_set: konfuzio_sdk.data.AnnotationSet) None
Add the extraction of a model to the document.
- property bento_metadata: dict
Metadata to include into the bento-saved instance of a model.
- build_bento(bento_model)
Build BentoML service for the model.
- check_is_ready()
Check if the ExtractionAI is ready for the inference.
It is assumed that the model is ready if a Category is set, and is ready for extraction.
- Raises
AttributeError – When no Category is specified.
- property entrypoint_methods: dict
Methods that will be exposed in a bento-saved instance of a model.
- evaluate()
Use as placeholder Function.
- extract(document: konfuzio_sdk.data.Document) konfuzio_sdk.data.Document
Perform preliminary extraction steps.
- extraction_result_to_document(document: konfuzio_sdk.data.Document, extraction_result: dict) konfuzio_sdk.data.Document
Return a virtual Document annotated with AI Model output.
- fit()
Use as placeholder Function because the Abstract AI does not train a classifier.
- static flush_buffer(buffer: List[pandas.core.series.Series], doc_text: str) Dict
Merge a buffer of entities into a dictionary (which will eventually be turned into a DataFrame).
A buffer is a list of pandas.Series objects.
- static has_compatible_interface(other) bool
Validate that an instance of an Extraction AI implements the same interface as AbstractExtractionAI.
An Extraction AI should implement methods with the same signature as: - AbstractExtractionAI.__init__ - AbstractExtractionAI.fit - AbstractExtractionAI.extract - AbstractExtractionAI.check_is_ready
- Parameters
other – An instance of an Extraction AI to compare with.
- static is_valid_horizontal_merge(row: pandas.core.series.Series, buffer: List[pandas.core.series.Series], doc_text: str, max_offset_distance: int = 5) bool
Verify if the merging that we are trying to do is valid.
- A merging is valid only if:
All spans have the same predicted Label
Confidence of predicted Label is above the Label threshold
All spans are on the same line
No extraneous characters in between spans
A maximum of 5 spaces in between spans
The Label type is not one of the following: ‘Number’, ‘Positive Number’, ‘Percentage’, ‘Date’ OR the resulting merging create a span normalizable to the same type
- Parameters
row – Row candidate to be merged to what is already in the buffer.
buffer – Previous information.
doc_text – Text of the document.
max_offset_distance – Maximum distance between two entities that can be merged.
- Returns
If the merge is valid or not.
- static load_model(pickle_path: str, max_ram: Optional[str] = None)
Load the model and check if it has the interface compatible with the class.
- Parameters
pickle_path (str) – Path to the pickled model.
- Raises
FileNotFoundError – If the path is invalid.
OSError – When the data is corrupted or invalid and cannot be loaded.
TypeError – When the loaded pickle isn’t recognized as a Konfuzio AI model.
- Returns
Extraction AI model.
- classmethod merge_horizontal(res_dict: Dict, doc_text: str) Dict
Merge contiguous spans with same predicted label.
See more details at https://dev.konfuzio.com/sdk/explanations.html#horizontal-merge
- property pkl_file_path: str
Generate a path for a resulting pickle file.
- property pkl_name: str
Generate a name for the pickle file.
- property project
Get RFExtractionAI Project.
- property temp_pkl_file_path: str
Generate a path for temporary pickle file.
Random Forest Extraction AI¶
- class konfuzio_sdk.trainer.information_extraction.RFExtractionAI(n_nearest: int = 2, first_word: bool = True, n_estimators: int = 100, max_depth: int = 100, no_label_limit: Optional[Union[int, float]] = None, n_nearest_across_lines: bool = False, use_separate_labels: bool = True, category: Optional[konfuzio_sdk.data.Category] = None, tokenizer=None, *args, **kwargs)
Encode visual and textual features to extract text regions.
Fit an extraction pipeline to extract linked Annotations.
Both Label and Label Set classifiers are using a RandomForestClassifier from scikit-learn to run in a low memory and single CPU environment. A random forest classifier is a group of decision trees classifiers, see: https://scikit-learn.org/stable/modules/generated/sklearn.ensemble.RandomForestClassifier.html
The parameters of this class allow to select the Tokenizer, to configure the Label and Label Set classifiers and to select the type of features used by the Label and Label Set classifiers.
They are divided in: - tokenizer selection - parametrization of the Label classifier - parametrization of the Label Set classifier - features for the Label classifier - features for the Label Set classifier
By default, the text of the Documents is split into smaller chunks of text based on whitespaces (‘WhitespaceTokenizer’). That means that all words present in the text will be shown to the AI. It is possible to define if the splitting of the text into smaller chunks should be done based on regexes learned from the Spans of the Annotations of the Category (‘tokenizer_regex’) or if to use a model from Spacy library for German language (‘tokenizer_spacy’). Another option is to use a pre-defined list of tokenizers based on regexes (‘tokenizer_regex_list’) and, on top of the pre-defined list, to create tokenizers that match what is missed by those (‘tokenizer_regex_combination’).
Some parameters of the scikit-learn RandomForestClassifier used for the Label and/or Label Set classifier can be set directly in Konfuzio Server (‘label_n_estimators’, ‘label_max_depth’, ‘label_class_weight’, ‘label_random_state’, ‘label_set_n_estimators’, ‘label_set_max_depth’).
Features are measurable pieces of data of the Annotation. By default, a combination of features is used that includes features built from the text of the Annotation (‘string_features’), features built from the position of the Annotation in the Document (‘spatial_features’) and features from the Spans created by a WhitespaceTokenizer on the left or on the right of the Annotation (‘n_nearest_left’, ‘n_nearest_right’, ‘n_nearest_across_lines). It is possible to exclude any of them (‘spatial_features’, ‘string_features’, ‘n_nearest_left’, ‘n_nearest_right’) or to specify the number of Spans created by a WhitespaceTokenizer to consider (‘n_nearest_left’, ‘n_nearest_right’).
While extracting, the Label Set classifier takes the predictions from the Label classifier as input. The Label Set classifier groups them into Annotation sets.
- check_is_ready()
Check if the ExtractionAI is ready for the inference.
It is assumed that the model is ready if a Tokenizer and a Category were set, Classifiers were set and trained.
- Raises
AttributeError – When no Tokenizer is specified.
AttributeError – When no Category is specified.
AttributeError – When no Label Classifier has been provided.
- evaluate_clf(use_training_docs: bool = False) konfuzio_sdk.evaluate.ExtractionEvaluation
Evaluate the Label classifier.
- evaluate_full(strict: bool = True, use_training_docs: bool = False, use_view_annotations: bool = True) konfuzio_sdk.evaluate.ExtractionEvaluation
Evaluate the full pipeline on the pipeline’s Test Documents.
- Parameters
strict – Evaluate on a Character exact level without any postprocessing.
use_training_docs – Bool for whether to evaluate on the training documents instead of testing documents.
- Returns
Evaluation object.
- evaluate_label_set_clf(use_training_docs: bool = False) konfuzio_sdk.evaluate.ExtractionEvaluation
Evaluate the LabelSet classifier.
- evaluate_tokenizer(use_training_docs: bool = False) konfuzio_sdk.evaluate.ExtractionEvaluation
Evaluate the tokenizer.
- extract(document: konfuzio_sdk.data.Document) konfuzio_sdk.data.Document
Infer information from a given Document.
- Parameters
document – Document object
- Returns
Document with predicted labels
- Raises
AttributeError: When missing a Tokenizer NotFittedError: When CLF is not fitted
- extract_from_df(df: pandas.core.frame.DataFrame, inference_document: konfuzio_sdk.data.Document) konfuzio_sdk.data.Document
Predict Labels from features.
- feature_function(documents: List[konfuzio_sdk.data.Document], no_label_limit: Optional[Union[int, float]] = None, retokenize: Optional[bool] = None, require_revised_annotations: bool = False) Tuple[List[pandas.core.frame.DataFrame], list]
Calculate features per Span of Annotations.
- Parameters
documents – List of Documents to extract features from.
no_label_limit – Int or Float to limit number of new Annotations to create during tokenization.
retokenize – Bool for whether to recreate Annotations from scratch or use already existing Annotations.
require_revised_annotations – Only allow calculation of features if no unrevised Annotation present.
- Returns
Dataframe of features and list of feature names.
- features(document: konfuzio_sdk.data.Document)
Calculate features using the best working default values that can be overwritten with self values.
- filter_dataframe(df: pandas.core.frame.DataFrame) pandas.core.frame.DataFrame
Filter dataframe rows accordingly with the confidence value.
Rows (extractions) where the accuracy value is below the threshold defined for the label are removed.
- Parameters
df – Dataframe with extraction results
- Returns
Filtered dataframe
- filter_low_confidence_extractions(result: Dict) Dict
Remove extractions with confidence below the threshold defined for the respective label.
The input is a dictionary where the values can be: - dataframe - dictionary where the values are dataframes - list of dictionaries where the values are dataframes
- Parameters
result – Extraction results
- Returns
Filtered dictionary.
- fit() sklearn.ensemble._forest.RandomForestClassifier
Given training data and the feature list this function returns the trained regression model.
- label_train_document(virtual_document: konfuzio_sdk.data.Document, original_document: konfuzio_sdk.data.Document)
Assign Labels to Annotations in newly tokenized virtual training Document.
- merge_vertical(document: konfuzio_sdk.data.Document, only_multiline_labels=True)
Merge Annotations with the same Label.
See more details at https://dev.konfuzio.com/sdk/explanations.html#vertical-merge
- Parameters
document – Document whose Annotations should be merged vertically
only_multiline_labels – Only merge if a multiline Label Annotation is in the Category Training set
- merge_vertical_like(document: konfuzio_sdk.data.Document, template_document: konfuzio_sdk.data.Document)
Merge Annotations the same way as in another copy of the same Document.
All single-Span Annotations in the current Document (self) are matched with corresponding multi-line Spans in the given Document and are merged in the same way. The Label of the new multi-line Annotations is taken to be the most common Label among the original single-line Annotations that are being merged.
- Parameters
document – Document with multi-line Annotations
- reduce_model_weight()
Remove all non-strictly necessary parameters before saving.
- remove_empty_dataframes_from_extraction(result: Dict) Dict
Remove empty dataframes from the result of an Extraction AI.
The input is a dictionary where the values can be: - dataframe - dictionary where the values are dataframes - list of dictionaries where the values are dataframes
- property requires_segmentation: bool
Return True if the Extraction AI requires detectron segmentation results to process Documents.
- separate_labels(res_dict: Dict) Dict
Undo the renaming of the labels.
In this way we have the output of the extraction in the correct format.
Categorization AI¶
Implements a Categorization Model.
Abstract Categorization AI¶
- class konfuzio_sdk.trainer.document_categorization.AbstractCategorizationAI(categories: List[konfuzio_sdk.data.Category], *args, **kwargs)
Abstract definition of a CategorizationAI.
- categorize(document: konfuzio_sdk.data.Document, recategorize: bool = False, inplace: bool = False) konfuzio_sdk.data.Document
Run categorization on a Document.
- Parameters
document – Input Document
recategorize – If the input Document is already categorized, the already present Category is used unless
this flag is True
- Parameters
inplace – Option to categorize the provided Document in place, which would assign the Category attribute
- Returns
Copy of the input Document with added CategoryAnnotation information
- check_is_ready()
Check if Categorization AI instance is ready for inference.
It is assumed that the model is ready when there is at least one Category passed as the input.
- Raises
AttributeError – When no Categories are passed into the model.
- property entrypoint_methods: dict
Methods that will be exposed in a bento-saved instance of a model.
- evaluate(use_training_docs: bool = False) konfuzio_sdk.evaluate.CategorizationEvaluation
Evaluate the full Categorization pipeline on the pipeline’s Test Documents.
- Parameters
use_training_docs – Bool for whether to evaluate on the Training Documents instead of Test Documents.
- Returns
Evaluation object.
- abstract fit() None
Train the Categorization AI.
- static has_compatible_interface(other)
Validate that an instance of a Categorization AI implements the same interface as AbstractCategorizationAI.
A Categorization AI should implement methods with the same signature as: - AbstractCategorizationAI.__init__ - AbstractCategorizationAI.fit - AbstractCategorizationAI._categorize_page - AbstractCategorizationAI.check_is_ready
- Parameters
other – An instance of a Categorization AI to compare with.
- static load_model(pickle_path: str, device='cpu')
Load the model and check if it has the interface compatible with the class.
- Parameters
pickle_path (str) – Path to the pickled model.
- Raises
FileNotFoundError – If the path is invalid.
OSError – When the data is corrupted or invalid and cannot be loaded.
TypeError – When the loaded pickle isn’t recognized as a Konfuzio AI model.
- Returns
Categorization AI model.
- name_lower()
Convert class name to machine-readable name.
- property pkl_file_path: str
Generate a path for a resulting pickle file.
- Returns
A string with the path.
- property pkl_name
Generate a unique extension-less name for a resulting pickle file.
- abstract save(output_dir: str, include_konfuzio=True)
Save the model to disk.
- property temp_pkl_file_path: str
Generate a path for temporary pickle file.
- Returns
A string with the path.
Name-based Categorization AI¶
- class konfuzio_sdk.trainer.document_categorization.NameBasedCategorizationAI(categories: List[konfuzio_sdk.data.Category], *args, **kwargs)
A simple, non-trainable model that predicts a Category for a given Document based on a predefined rule.
It checks for whether the name of the Category is present in the input Document (case insensitive; also see Category.fallback_name). This can be an effective fallback logic to categorize Documents when no Categorization AI is available.
- fit() None
Use as placeholder Function because there’s no classifier to be trainer.
- save(output_dir: str, include_konfuzio=True)
Use as placeholder Function.
Model-based Categorization AI¶
- class konfuzio_sdk.trainer.document_categorization.CategorizationAI(categories: List[konfuzio_sdk.data.Category], use_cuda: bool = False, *args, **kwargs)
A trainable AI that predicts a Category for each Page of a given Document.
- build_document_classifier_iterator(documents, transforms, use_image: bool, use_text: bool, shuffle: bool, batch_size: int, max_len: int, device='cpu') torch.utils.data.dataloader.DataLoader
Prepare the data necessary for the document classifier, and build the iterators for the data list.
- For each document we split into pages and from each page we take:
the path to an image of the page
the tokenized and numericalized text on the page
the label (category) of the page
the id of the document
the page number
- build_preprocessing_pipeline(use_image: bool, image_augmentation=None, image_preprocessing=None) None
Set up the pre-processing and data augmentation when necessary.
- build_template_category_vocab() konfuzio_sdk.tokenizer.base.Vocab
Build a vocabulary over the Categories.
- build_text_vocab(min_freq: int = 1, max_size: Optional[int] = None) konfuzio_sdk.tokenizer.base.Vocab
Build a vocabulary over the document text.
- property compressed_file_path: str
Generate a path for a resulting compressed file in .lz4 format.
- Returns
A string with the path.
- fit(max_len: Optional[bool] = None, batch_size: int = 1, **kwargs) Dict[str, List[float]]
Fit the CategorizationAI classifier.
- reduce_model_weight()
Reduce the size of the model by running lose_weight on the tokenizer.
- save(output_dir: Optional[str] = None, reduce_weight: bool = True, **kwargs) str
Save only the necessary parts of the model for extraction/inference.
Saves: - tokenizer (needed to ensure we tokenize inference examples in the same way that they are trained) - transforms (to ensure we transform/pre-process images in the same way as training) - vocabs (to ensure the tokens/labels are mapped to the same integers as training) - configs (to ensure we load the same models used in training) - state_dicts (the classifier parameters achieved through training)
Note: “path” is a deprecated parameter, “output_dir” is used for the sake of uniformity across all AIs.
- Parameters
output_dir (str) – A path to save the model to.
reduce_weight (bool) – Reduces the weight of a model by removing Documents and reducing weight of a Tokenizer.
- property temp_pt_file_path: str
Generate a path for s temporary model file in .pt format.
- Returns
A string with the path.
Build a Model-based Categorization AI¶
- konfuzio_sdk.trainer.document_categorization.build_categorization_ai_pipeline(categories: List[konfuzio_sdk.data.Category], documents: List[konfuzio_sdk.data.Document], test_documents: List[konfuzio_sdk.data.Document], tokenizer: Optional[konfuzio_sdk.tokenizer.base.AbstractTokenizer] = None, image_model_name: Optional[konfuzio_sdk.trainer.document_categorization.ImageModel] = None, text_model_name: Optional[konfuzio_sdk.trainer.document_categorization.TextModel] = TextModel.NBOW, **kwargs) konfuzio_sdk.trainer.document_categorization.CategorizationAI
Build a Categorization AI neural network by choosing an ImageModel and a TextModel.
See an in-depth tutorial at https://dev.konfuzio.com/sdk/tutorials/data_validation/index.html
NBOW Model¶
- class konfuzio_sdk.trainer.document_categorization.NBOW(input_dim: int, emb_dim: int = 64, dropout_rate: float = 0.0, **kwargs)
The neural bag-of-words (NBOW) model is the simplest of models, it passes each token through an embedding layer.
As shown in the fastText paper (https://arxiv.org/abs/1607.01759) this model is still able to achieve comparable performance to some deep learning models whilst being considerably faster.
One downside of this model is that tokens are embedded without regards to the surrounding context in which they appear, e.g. the embedding for “May” in the two sentences “May I speak to you?” and “I am leaving on the 1st of May” are identical, even though they have different semantics.
- Parameters
emb_dim – The dimensions of the embedding vector.
dropout_rate – The amount of dropout applied to the embedding vectors.
NBOW Self Attention Model¶
- class konfuzio_sdk.trainer.document_categorization.NBOWSelfAttention(input_dim: int, emb_dim: int = 64, n_heads: int = 8, dropout_rate: float = 0.0, **kwargs)
This is an NBOW model with a multi-headed self-attention layer, which is added after the embedding layer.
See details at https://arxiv.org/abs/1706.03762. The self-attention layer effectively contextualizes the output as now each hidden state is calculated from the embedding vector of a token and the embedding vector of all other tokens within the sequence.
- Parameters
emb_dim – The dimensions of the embedding vector.
dropout_rate – The amount of dropout applied to the embedding vectors.
n_heads – The number of attention heads to use in the multi-headed self-attention layer. Note that n_heads
must be a factor of emb_dim, i.e. emb_dim % n_heads == 0.
LSTM Model¶
- class konfuzio_sdk.trainer.document_categorization.LSTM(input_dim: int, emb_dim: int = 64, hid_dim: int = 256, n_layers: int = 2, bidirectional: bool = True, dropout_rate: float = 0.0, **kwargs)
The LSTM (long short-term memory) is a variant of an RNN (recurrent neural network).
It feeds the input tokens through an embedding layer and then processes them sequentially with the LSTM, outputting a hidden state for each token. If the LSTM is bidirectional then it trains a forward and backward LSTM per layer and concatenates the forward and backward hidden states for each token.
- Parameters
emb_dim – The dimensions of the embedding vector.
hid_dim – The dimensions of the hidden states.
n_layers – How many LSTM layers to use.
bidirectional – If the LSTM should be bidirectional.
dropout_rate – The amount of dropout applied to the embedding vectors and between LSTM layers if
n_layers > 1.
BERT Model¶
- class konfuzio_sdk.trainer.document_categorization.BERT(name: str = 'bert-base-german-cased', freeze: bool = False, **kwargs)
Wraps around pre-trained BERT-type models from the HuggingFace library.
BERT (bidirectional encoder representations from Transformers) is a family of large Transformer models. The available BERT variants are all pre-trained models provided by the transformers library. It is usually infeasible to train a BERT model from scratch due to the significant amount of computation required. However, the pre-trained models can be easily fine-tuned on desired data.
- The BERT variants, i.e. name arguments, that are covered by internal tests are:
bert-base-german-cased
bert-base-german-dbmdz-cased
bert-base-german-dbmdz-uncased
distilbert-base-german-cased
In theory, all variants beginning with bert-base-* and distilbert-* should work out of the box. Other BERT variants come with no guarantees.
- Parameters
name – The name of the pre-trained BERT variant to use.
freeze – Should the BERT model be frozen, i.e. the pre-trained parameters are not updated.
- get_max_length()
Get the maximum length of a sequence that can be passed to the BERT module.
VGG Model¶
- class konfuzio_sdk.trainer.document_categorization.VGG(name: str = 'vgg11', pretrained: bool = True, freeze: bool = True, **kwargs)
The VGG family of models are image classification models designed for the ImageNet.
They are usually used as a baseline in image classification tasks, however are considerably larger - in terms of the number of parameters - than modern architectures.
Available variants are: vgg11, vgg13, vgg16, vgg19, vgg11_bn, vgg13_bn, vgg16_bn, vgg19_bn. The number generally indicates the number of layers in the model, higher does not always mean better. The _bn suffix means that the VGG model uses Batch Normalization layers, this generally leads to better results.
The pre-trained weights are taken from the [torchvision](https://github.com/pytorch/vision) library and are weights from a model that has been trained as an image classifier on ImageNet. Ideally, this means the images should be 3-channel color images that are at least 224x224 pixels and should be normalized.
- Parameters
name – The name of the VGG variant to use
pretrained – If pre-trained weights for the VGG variant should be used
freeze – If the parameters of the VGG variant should be frozen
EfficientNet Model¶
- class konfuzio_sdk.trainer.document_categorization.EfficientNet(name: str = 'efficientnet_b0', pretrained: bool = True, freeze: bool = True, **kwargs)
EfficientNet is a family of convolutional neural network based models that are designed to be more efficient.
The efficiency comes in terms of the number of parameters and FLOPS, compared to previous computer vision models whilst maintaining equivalent image classification performance.
Available variants are: efficientnet_b0, efficientnet_b1, …, efficienet_b7. With b0 having the least amount of parameters and b7 having the most.
The pre-trained weights are taken from the timm library and have been trained on ImageNet, thus the same tips, i.e. normalization, that apply to the VGG models also apply here.
- Parameters
name – The name of the EfficientNet variant to use
pretrained – If pre-trained weights for the EfficientNet variant should be used
freeze – If the parameters of the EfficientNet variant should be frozen
- get_n_features() int
Calculate number of output features based on given model.
Multimodal Concatenation¶
- class konfuzio_sdk.trainer.document_categorization.MultimodalConcatenate(n_image_features: int, n_text_features: int, hid_dim: int = 256, output_dim: Optional[int] = None, **kwargs)
Defines how the image and text features are combined in order to yield a categorization prediction.
File Splitting AI¶
Process Documents that consist of several files and propose splitting them into the Sub-Documents accordingly.
Abstract File Splitting Model¶
- class konfuzio_sdk.trainer.file_splitting.AbstractFileSplittingModel(categories: List[konfuzio_sdk.data.Category], *args, **kwargs)
Abstract class for the File Splitting model.
- property entrypoint_methods: dict
Methods that will be exposed in a bento-saved instance of a model.
- abstract fit(*args, **kwargs)
Fit the custom model on the training Documents.
- static has_compatible_interface(other) bool
Validate that an instance of a File Splitting Model implements the same interface as AbstractFileSplittingModel.
A File Splitting Model should implement methods with the same signature as: - AbstractFileSplittingModel.__init__ - AbstractFileSplittingModel.predict - AbstractFileSplittingModel.fit - AbstractFileSplittingModel.check_is_ready
- Parameters
other – An instance of a File Splitting Model to compare with.
- static load_model(pickle_path: str, max_ram: Optional[str] = None)
Load the model and check if it has the interface compatible with the class.
- Parameters
pickle_path (str) – Path to the pickled model.
- Raises
FileNotFoundError – If the path is invalid.
OSError – When the data is corrupted or invalid and cannot be loaded.
TypeError – When the loaded pickle isn’t recognized as a Konfuzio AI model.
- Returns
File Splitting AI model.
- property pkl_file_path: str
Generate a path for a resulting pickle file.
- Returns
A string with the path.
- property pkl_name
Generate a unique extension-less name for a resulting pickle file.
- abstract predict(page: konfuzio_sdk.data.Page) konfuzio_sdk.data.Page
Take a Page as an input and reassign is_first_page attribute’s value if necessary.
- Parameters
page (Page) – A Page to label first or non-first.
- Returns
Page.
- property temp_pkl_file_path: str
Generate a path for temporary pickle file.
- Returns
A string with the path.
Context Aware File Splitting Model¶
- class konfuzio_sdk.trainer.file_splitting.ContextAwareFileSplittingModel(categories: List[konfuzio_sdk.data.Category], tokenizer, *args, **kwargs)
A File Splitting Model that uses a context-aware logic.
Context-aware logic implies a rule-based approach that looks for common strings between the first Pages of all Category’s Documents.
- check_is_ready()
Check File Splitting Model is ready for inference.
- Raises
AttributeError – When no Tokenizer or no Categories were passed.
ValueError – When no Categories have _exclusive_first_page_strings.
- fit(allow_empty_categories: bool = False, *args, **kwargs)
Gather the strings exclusive for first Pages in a given stream of Documents.
Exclusive means that each of these strings appear only on first Pages of Documents within a Category.
- Parameters
allow_empty_categories – To allow returning empty list for a Category if no exclusive first-page strings
were found during fitting (which means prediction would be impossible for a Category). :type allow_empty_categories: bool :raises ValueError: When allow_empty_categories is False and no exclusive first-page strings were found for at least one Category.
- predict(page: konfuzio_sdk.data.Page) konfuzio_sdk.data.Page
Predict a Page as first or non-first.
- Parameters
page (Page) – A Page to receive first or non-first label.
- Returns
A Page with a newly predicted is_first_page attribute.
Multimodal File Splitting Model¶
- class konfuzio_sdk.trainer.file_splitting.MultimodalFileSplittingModel(categories: List[konfuzio_sdk.data.Category], text_processing_model: str = 'nlpaueb/legal-bert-small-uncased', scale: int = 2, *args, **kwargs)
Split a multi-Document file into a list of shorter Documents based on model’s prediction.
We use an approach suggested by Guha et al.(2022) that incorporates steps for accepting separate visual and textual inputs and processing them independently via the VGG19 architecture and LegalBERT model which is essentially a BERT-type architecture trained on domain-specific data, and passing the resulting outputs together to a Multi-Layered Perceptron.
Guha, A., Alahmadi, A., Samanta, D., Khan, M. Z., & Alahmadi, A. H. (2022). A Multi-Modal Approach to Digital Document Stream Segmentation for Title Insurance Domain. https://ieeexplore.ieee.org/stamp/stamp.jsp?arnumber=9684474
- check_is_ready()
Check if Multimodal File Splitting Model instance is ready for inference.
A method checks that the instance of the Model has at least one Category passed as the input and that it is fitted to run prediction.
- Raises
AttributeError – When no Categories are passed to the model.
AttributeError – When a model is not fitted to run a prediction.
- fit(epochs: int = 10, use_gpu: bool = False, train_batch_size=8, *args, **kwargs)
Process the train and test data, initialize and fit the model.
- Parameters
epochs (int) – A number of epochs to train a model on.
use_gpu (bool) – Run training on GPU if available.
- predict(page: konfuzio_sdk.data.Page, use_gpu: bool = False) konfuzio_sdk.data.Page
Run prediction with the trained model.
- Parameters
page (Page) – A Page to be predicted as first or non-first.
use_gpu (bool) – Run prediction on GPU if available.
- Returns
A Page with possible changes in is_first_page attribute value.
- reduce_model_weight()
Remove all non-strictly necessary parameters before saving.
- remove_dependencies()
Remove dependencies before saving.
This is needed for proper saving of the model in lz4 compressed format – if the dependencies are not removed, the resulting pickle will be impossible to load.
- restore_dependencies()
Restore removed dependencies after loading.
This is needed for proper functioning of a loaded model because we have previously removed these dependencies upon saving the model.
Textual File Splitting Model¶
- class konfuzio_sdk.trainer.file_splitting.TextualFileSplittingModel(categories: List[konfuzio_sdk.data.Category], *args, **kwargs)
This model operates by taking input a multi-Document file and utilizing the DistilBERT model to make predictions regarding the segmentation of this document. Specifically, it aims to identify boundaries within the text where one document ends and another begins, effectively splitting the input into a list of shorter documents.
DistilBERT serves as the backbone of this model. DistilBERT offers a computationally efficient alternative to BERT, achieved through knowledge distillation while preserving much of BERT’s language understanding capabilities.
Sanh, V., Debut, L., Chaumond, J., & Wolf, T. (2019). DistilBERT, a distilled version of BERT: smaller, faster, cheaper and lighter https://arxiv.org/abs/1910.01108.
- check_is_ready()
Check if Textual File Splitting Model instance is ready for inference.
A method checks that the instance of the Model has at least one Category passed as the input and that it is fitted to run prediction.
- Raises
AttributeError – When no Categories are passed to the model.
AttributeError – When a model is not fitted to run a prediction.
- fit(epochs: int = 5, eval_batch_size: int = 8, train_batch_size: int = 8, device: str = 'cpu', *args, **kwargs)
Process the train and test data, initialize and fit the model.
- Parameters
epochs (int) – A number of epochs to train a model on.
eval_batch_size (int) – A batch size for evaluation.
train_batch_size (int) – A batch size for training.
device (str) – A device to run the prediction on. Possible values are ‘mps’, ‘cuda’, ‘cpu’.
- Returns
A dictionary with evaluation results.
- predict(page: konfuzio_sdk.data.Page, previous_page: Optional[konfuzio_sdk.data.Page] = None, device: str = 'cpu', *args, **kwargs) konfuzio_sdk.data.Page
Run prediction with the trained model.
- Parameters
page (Page) – A Page to be predicted as first or non-first.
previous_page – The previous Page which would help give more context to the model
device (str) – A device to run the prediction on. Possible values are ‘mps’, ‘cuda’, ‘cpu’.
- Returns
A Page with possible changes in is_first_page attribute value.
- reduce_model_weight()
Remove all non-strictly necessary parameters before saving.
- static remove_dependencies()
Remove dependencies before saving.
This is needed for proper saving of the model in lz4 compressed format – if the dependencies are not removed, the resulting pickle will be impossible to load.
- static restore_dependencies()
Restore removed dependencies after loading.
This is needed for proper functioning of a loaded model because we have previously removed these dependencies upon saving the model.
Splitting AI¶
- class konfuzio_sdk.trainer.file_splitting.SplittingAI(model)
Split a given Document and return a list of resulting shorter Documents.
- evaluate_full(use_training_docs: bool = False, zero_division='warn') konfuzio_sdk.evaluate.FileSplittingEvaluation
Evaluate the Splitting AI’s performance.
- Parameters
use_training_docs – If enabled, runs evaluation on the training data to define its quality; if disabled,
runs evaluation on the test data. :type use_training_docs: bool :param zero_division: Defines how to handle situations when precision, recall or F1 measure calculations result in zero division. Possible values: ‘warn’ – log a warning and assign a calculated metric a value of 0. 0 - assign a calculated metric a value of 0. ‘error’ – raise a ZeroDivisionError. None – assign None to a calculated metric. :return: Evaluation information for the model.
- propose_split_documents(document: konfuzio_sdk.data.Document, return_pages: bool = False, inplace: bool = False, split_on_blank_pages: bool = False, device: str = 'cpu') List[konfuzio_sdk.data.Document]
Propose a set of resulting Documents from a single Document.
- Parameters
document (Document) – An input Document to be split.
inplace (bool) – Whether changes are applied to the input Document, changing it, or to a deepcopy of it.
return_pages – A flag to enable returning a copy of an old Document with Pages marked .is_first_page on
splitting points instead of a set of Sub-Documents. :type return_pages: bool :param split_on_blank_pages: A flag to enable splitting on blank Pages. :type split_on_blank_pages: bool :return: A list of suggested new Sub-Documents built from the original Document or a list with a Document with Pages marked .is_first_page on splitting points.
AI Evaluation¶
Extraction AI Evaluation¶
- class konfuzio_sdk.evaluate.ExtractionEvaluation(documents: List[Tuple[konfuzio_sdk.data.Document, konfuzio_sdk.data.Document]], strict: bool = True, use_view_annotations: bool = True, ignore_below_threshold: bool = True, zero_division='warn')
Calculated accuracy measures by using the detailed comparison on Span Level.
- calculate()
Calculate and update the data stored within this Evaluation.
- calculate_thresholds()
Calculate optimal thresholds for each Label in the Document set that allow to achieve the highest value of F1 score, precision and recall.
- clf_f1(search=None) Optional[float]
Calculate the F1 Score of one the Label classifier.
- Parameters
search – Parameter used to calculate the value for one Data object.
- clf_fn(search=None) int
Return the Label classifier False Negatives of all Spans.
- clf_fp(search=None) int
Return the Label classifier False Positives of all Spans.
- clf_tp(search=None) int
Return the Label classifier True Positives of all Spans.
- f1(search=None) Optional[float]
Calculate the F1 Score of one class.
Please note: As suggested by Opitz et al. (2021) use the arithmetic mean over individual F1 scores.
“F1 is often used with the intention to assign equal weight to frequent and infrequent classes, we recommend evaluating classifiers with F1 (the arithmetic mean over individual F1 scores), which is significantly more robust towards the error type distribution.”
Opitz, Juri, and Sebastian Burst. “Macro F1 and Macro F1.” arXiv preprint arXiv:1911.03347 (2021). https://arxiv.org/pdf/1911.03347.pdf
- Parameters
search – Parameter used to calculate the value for one class.
- Example:
If you have three Documents, calculate the F-1 Score per Document and use the arithmetic mean.
If you have three Labels, calculate the F-1 Score per Label and use the arithmetic mean.
If you have three Labels and three documents, calculate six F-1 Scores and use the arithmetic mean.
- fn(search=None) int
Return the False Negatives of all Spans.
- fp(search=None) int
Return the False Positives of all Spans.
- get_evaluation_data(search, allow_zero: bool = True) konfuzio_sdk.evaluate.EvaluationCalculator
Get precision, recall, f1, based on TP, FP, FN.
- get_missing_vertical_merge()
Return Spans that should have been merged.
- get_wrong_vertical_merge()
Return Spans that were wrongly merged vertically.
- gt(search=None) int
Return the number of ground-truth Annotations for a given Label.
- precision(search=None) Optional[float]
Calculate the Precision and see f1 to calculate imbalanced classes.
- recall(search=None) Optional[float]
Calculate the Recall and see f1 to calculate imbalanced classes.
- tn(search=None) int
Return the True Negatives of all Spans.
- tokenizer_f1(search=None) Optional[float]
Calculate the F1 Score of one the tokenizer.
- Parameters
search – Parameter used to calculate the value for one Data object.
- tokenizer_fn(search=None) int
Return the tokenizer False Negatives of all Spans.
- tokenizer_fp(search=None) int
Return the tokenizer False Positives of all Spans.
- tokenizer_tp(search=None) int
Return the tokenizer True Positives of all Spans.
- tp(search=None) int
Return the True Positives of all Spans.
Categorization AI Evaluation¶
- class konfuzio_sdk.evaluate.CategorizationEvaluation(categories: List[konfuzio_sdk.data.Category], documents: List[Tuple[konfuzio_sdk.data.Document, konfuzio_sdk.data.Document]], zero_division='warn')
Calculated evaluation measures for the classification task of Document categorization.
- property actual_classes: List[int]
List of ground truth Category IDs.
- calculate()
Calculate and update the data stored within this Evaluation.
- property category_ids: List[int]
List of Category IDs as class labels.
- property category_names: List[str]
List of Category names as class names.
- confusion_matrix() pandas.core.frame.DataFrame
Confusion matrix.
- f1(category: Optional[konfuzio_sdk.data.Category]) Optional[float]
Calculate the global F1 Score or filter it by one Category.
- fn(category: Optional[konfuzio_sdk.data.Category] = None) int
Return the False Negatives of all Documents.
- fp(category: Optional[konfuzio_sdk.data.Category] = None) int
Return the False Positives of all Documents.
- get_evaluation_data(search: Optional[konfuzio_sdk.data.Category] = None, allow_zero: bool = True) konfuzio_sdk.evaluate.EvaluationCalculator
Get precision, recall, f1, based on TP, TN, FP, FN.
- Parameters
search (Category) – A Category to filter for, or None for getting global evaluation results.
allow_zero – If true, will calculate None for precision and recall when the straightforward application
of the formula would otherwise result in 0/0. Raises ZeroDivisionError otherwise. :type allow_zero: bool
- gt(category: Optional[konfuzio_sdk.data.Category] = None) int
Placeholder for compatibility with Server.
- precision(category: Optional[konfuzio_sdk.data.Category]) Optional[float]
Calculate the global Precision or filter it by one Category.
- property predicted_classes: List[int]
List of predicted Category IDs.
- recall(category: Optional[konfuzio_sdk.data.Category]) Optional[float]
Calculate the global Recall or filter it by one Category.
- tn(category: Optional[konfuzio_sdk.data.Category] = None) int
Return the True Negatives of all Documents.
- tp(category: Optional[konfuzio_sdk.data.Category] = None) int
Return the True Positives of all Documents.
File Splitting AI Evaluation¶
- class konfuzio_sdk.evaluate.FileSplittingEvaluation(ground_truth_documents: List[konfuzio_sdk.data.Document], prediction_documents: List[konfuzio_sdk.data.Document], zero_division='warn')
Evaluate the quality of the filesplitting logic.
- calculate()
Calculate metrics for the File Splitting logic.
- calculate_metrics_by_category()
Calculate metrics by Category independently.
- f1(search: Optional[konfuzio_sdk.data.Category] = None) float
Return F1-measure.
- Parameters
search (Category) – display F1 measure within a certain Category.
- Raises
KeyError – When the Category in search is not present in the Project from which the Documents are.
- fn(search: Optional[konfuzio_sdk.data.Category] = None) int
Return first Pages incorrectly predicted as non-first.
- Parameters
search (Category) – display false negatives within a certain Category.
- Raises
KeyError – When the Category in search is not present in the Project from which the Documents are.
- fp(search: Optional[konfuzio_sdk.data.Category] = None) int
Return non-first Pages incorrectly predicted as first.
- Parameters
search (Category) – display false positives within a certain Category.
- Raises
KeyError – When the Category in search is not present in the Project from which the Documents are.
- get_evaluation_data(search: Optional[konfuzio_sdk.data.Category] = None, allow_zero: bool = True) konfuzio_sdk.evaluate.EvaluationCalculator
Get precision, recall, f1, based on TP, TN, FP, FN.
- Parameters
search (Category) – display true positives within a certain Category.
allow_zero – If true, will calculate None for precision and recall when the straightforward application
of the formula would otherwise result in 0/0. Raises ZeroDivisionError otherwise. :type allow_zero: bool
- gt(search: Optional[konfuzio_sdk.data.Category] = None) int
Placeholder for compatibility with Server.
- precision(search: Optional[konfuzio_sdk.data.Category] = None) float
Return precision.
- Parameters
search (Category) – display precision within a certain Category.
- Raises
KeyError – When the Category in search is not present in the Project from which the Documents are.
- recall(search: Optional[konfuzio_sdk.data.Category] = None) float
Return recall.
- Parameters
search (Category) – display recall within a certain Category.
- Raises
KeyError – When the Category in search is not present in the Project from which the Documents are.
- tn(search: Optional[konfuzio_sdk.data.Category] = None) int
Return non-first Pages predicted as non-first.
- Parameters
search (Category) – display true negatives within a certain Category.
- Raises
KeyError – When the Category in search is not present in the Project from which the Documents are.
- tp(search: Optional[konfuzio_sdk.data.Category] = None) int
Return correctly predicted first Pages.
- Parameters
search (Category) – display true positives within a certain Category.
- Raises
KeyError – When the Category in search is not present in the Project from which the Documents are.
Evaluation Calculator¶
- class konfuzio_sdk.evaluate.EvaluationCalculator(tp: int = 0, fp: int = 0, fn: int = 0, tn: int = 0, zero_division='warn')
Calculate precision, recall, f1, based on TP, FP, FN.
- property f1: Optional[float]
Apply F1-score formula.
- Raises
ZeroDivisionError – When precision and recall are 0 and zero_division is set to ‘error’
- metrics_logging()
Log metrics.
- property precision: Optional[float]
Apply precision formula.
- Raises
ZeroDivisionError – When TP and FP are 0 and zero_division is set to ‘error’
- property recall: Optional[float]
Apply recall formula.
- Raises
ZeroDivisionError – When TP and FN are 0 and zero_division is set to ‘error’
- konfuzio_sdk.evaluate.grouped(group, target: str)¶
Define which of the correct element in the predicted group defines the “correct” group id_.
- konfuzio_sdk.evaluate.compare(doc_a, doc_b, only_use_correct=False, use_view_annotations=False, ignore_below_threshold=False, strict=True, id_counter: int = 1, custom_threshold=None) pandas.core.frame.DataFrame ¶
Compare the Annotations of two potentially empty Documents wrt. to all Annotations.
- Parameters
doc_a – Document which is assumed to be correct
doc_b – Document which needs to be evaluated
only_use_correct – Unrevised feedback in doc_a is assumed to be correct.
use_view_annotations – Will filter for top confidence annotations. Only available when strict=True. When use_view_annotations=True, it will compare only the highest confidence extractions to the ground truth Annotations. When False (default), it compares all extractions to the ground truth Annotations. This setting is ignored when strict=False, as the Non-Strict Evaluation needs to compare all extractions. For more details see https://help.konfuzio.com/modules/extractions/index.html#evaluation
ignore_below_threshold – Ignore Annotations below detection threshold of the Label (only affects TNs)
strict – Evaluate on a Character exact level without any postprocessing, an amount Span “5,55 ” will not be exact with “5,55”
- Raises
ValueError – When the Category differs.
- Returns
Evaluation DataFrame
Trainer utils¶
Add utility common functions and classes to be used for AI Training.
LoggerCallback¶
- class konfuzio_sdk.trainer.utils.LoggerCallback
Custom callback for logger.info to be used in Trainer.
This callback is called by Trainer at the end of every epoch to log metrics. It replaces calling print and tqdm and calls logger.info instead.
- on_log(args, state, control, logs=None, **kwargs)
Log losses and metrics when training or evaluating using Trainer.
BalancedLossTrainer¶
- class konfuzio_sdk.trainer.utils.BalancedLossTrainer(model: Optional[Union[transformers.modeling_utils.PreTrainedModel, torch.nn.modules.module.Module]] = None, args: Optional[transformers.training_args.TrainingArguments] = None, data_collator: Optional[DataCollator] = None, train_dataset: Optional[torch.utils.data.dataset.Dataset] = None, eval_dataset: Optional[Union[torch.utils.data.dataset.Dataset, Dict[str, torch.utils.data.dataset.Dataset]]] = None, tokenizer: Optional[transformers.tokenization_utils_base.PreTrainedTokenizerBase] = None, model_init: Optional[Callable[[], transformers.modeling_utils.PreTrainedModel]] = None, compute_metrics: Optional[Callable[[transformers.trainer_utils.EvalPrediction], Dict]] = None, callbacks: Optional[List[transformers.trainer_callback.TrainerCallback]] = None, optimizers: Tuple[torch.optim.optimizer.Optimizer, torch.optim.lr_scheduler.LambdaLR] = (None, None), preprocess_logits_for_metrics: Optional[Callable[[torch.Tensor, torch.Tensor], torch.Tensor]] = None)
Custom trainer with custom loss to leverage class weights.
- compute_loss(model, inputs, return_outputs=False)
Compute weighted cross-entropy loss to recompensate for unbalanced datasets.
- log(logs: Dict[str, float]) None
Log logs on the various objects watching training.
Subclass and override this method to inject custom behavior.
param: logs (Dict[str, float]): The values to log.
AI containerization¶
Pydantic schemas¶
Containerization utils¶
- konfuzio_sdk.bento.extraction.utils.prepare_request(request: pydantic.main.BaseModel, project: konfuzio_sdk.data.Project, konfuzio_sdk_version: Optional[str] = None) konfuzio_sdk.data.Document ¶
Receive a request and prepare it for the extraction runner.
- Parameters
request – Unprocessed request.
project – A Project instance.
konfuzio_sdk_version – The version of the Konfuzio SDK used by the embedded AI model. Used to apply backwards compatibility changes for older SDK versions.
- Returns
An instance of a Document class.
- konfuzio_sdk.bento.extraction.utils.process_response(result, schema: pydantic.main.BaseModel = <class 'konfuzio_sdk.bento.extraction.schemas.ExtractResponse20240117'>) pydantic.main.BaseModel ¶
Process a raw response from the runner to contain only selected fields.
- Parameters
result – A raw response to be processed.
schema – A schema of the response.
- Returns
A list of dictionaries with Label Set IDs and Annotation data.
- konfuzio_sdk.bento.extraction.utils.convert_document_to_request(document: konfuzio_sdk.data.Document, schema: pydantic.main.BaseModel = <class 'konfuzio_sdk.bento.extraction.schemas.ExtractRequest20240117'>) pydantic.main.BaseModel ¶
Receive a Document and convert it into a request in accordance to a passed schema.
- Parameters
document – A Document to be converted.
schema – A schema to which the request should adhere.
- Returns
A Document converted in accordance with the schema.
- konfuzio_sdk.bento.extraction.utils.convert_response_to_annotations(response: pydantic.main.BaseModel, document: konfuzio_sdk.data.Document, mappings: Optional[dict] = None) konfuzio_sdk.data.Document ¶
Receive an ExtractResponse and convert it into a list of Annotations to be added to the Document.
- Parameters
response – An ExtractResponse to be converted.
document – A Document to which the annotations should be added.
mappings – A dict with “label_sets” and “labels” keys, both containing mappings from old to new IDs. Original IDs are used if no mapping is provided or if the mapping is not found.
- Returns
The original Document with added Annotations.