Data¶
Handle data from the API.
Span¶
- class konfuzio_sdk.data.Span(start_offset: int, end_offset: int, annotation=None)
A Span is a sequence of characters or whitespaces without line break.
- bbox() konfuzio_sdk.data.Bbox
Calculate the bounding box of a text sequence.
- eval_dict()
Return any information needed to evaluate the Span.
- property line_index: int
Return index of the line of the Span.
- property normalized
Normalize the offset string.
- property offset_string: Optional[str]
Calculate the offset string of a Span.
- property page: konfuzio_sdk.data.Page
Return Page of Span.
- regex()
Suggest a Regex for the offset string.
Annotation¶
- class konfuzio_sdk.data.Annotation(document: konfuzio_sdk.data.Document, annotation_set_id: Optional[int] = None, annotation_set: Optional[konfuzio_sdk.data.AnnotationSet] = None, label: Optional[Union[int, konfuzio_sdk.data.Label]] = None, label_set_id: Optional[int] = None, label_set: Union[None, konfuzio_sdk.data.LabelSet] = None, is_correct: bool = False, revised: bool = False, normalized=None, id_: Optional[int] = None, spans=None, accuracy: Optional[float] = None, confidence: Optional[float] = None, created_by: Optional[int] = None, revised_by: Optional[int] = None, translated_string: Optional[str] = None, custom_offset_string: bool = False, offset_string: str = False, *args, **kwargs)
Hold information that a Label, Label Set and Annotation Set has been assigned to and combines Spans.
- add_span(span: konfuzio_sdk.data.Span)
Add a Span to an Annotation incl. a duplicate check per Annotation.
- delete() None
Delete Annotation online.
- property end_offset: int
Legacy: One Annotation can have multiple end offsets.
- property eval_dict: List[dict]
Calculate the Span information to evaluate the Annotation.
- get_link()
Get link to the Annotation in the SmartView.
- property is_multiline: int
Calculate if Annotation spans multiple lines of text.
- property normalize: str
Provide one normalized offset string due to legacy.
- property offset_string: List[str]
View the string representation of the Annotation.
- regex()
Return regex of this annotation.
- regex_annotation_generator(regex_list) List[konfuzio_sdk.data.Span]
Build Spans without Labels by regexes.
- Returns
Return sorted list of Spans by start_offset
- save(document_annotations: Optional[list] = None) bool
Save Annotation online.
If there is already an Annotation in the same place as the current one, we will not be able to save the current annotation.
In that case, we get the id_ of the original one to be able to track it. The verification of the duplicates is done by checking if the offsets and Label match with any Annotations online. To be sure that we are comparing with the information online, we need to have the Document updated. The update can be done after the request (per annotation) or the updated Annotations can be passed as input of the function (advisable when dealing with big Documents or Documents with many Annotations).
- Parameters
document_annotations – Annotations in the Document (list)
- Returns
True if new Annotation was created
- property spans: List[konfuzio_sdk.data.Span]
Return default entry to get all Spans of the Annotation.
- property start_offset: int
Legacy: One Annotation can have multiple start offsets.
- token_append(new_regex, regex_quality: int)
Append token if it is not a duplicate.
- tokens() List[str]
Create a list of potential tokens based on Spans of this Annotation.
Annotation Set¶
- class konfuzio_sdk.data.AnnotationSet(document, label_set: konfuzio_sdk.data.LabelSet, id_: Optional[int] = None, **kwargs)
An Annotation Set is a group of Annotations. The Labels of those Annotations refer to the same Label Set.
- property annotations
All Annotations currently in this Annotation Set.
- property end_offset
Calculate the end based on all Annotations currently in this Annotation Set.
- lose_weight()
Delete data of the instance.
- property start_line_index
Calculate starting line of this Annotation Set.
- property start_offset
Calculate the earliest start based on all Annotations currently in this Annotation Set.
Label¶
- class konfuzio_sdk.data.Label(project, id_: Optional[int] = None, text: Optional[str] = None, get_data_type_display: str = 'Text', text_clean: Optional[str] = None, description: Optional[str] = None, label_sets=None, has_multiple_top_candidates: bool = False, threshold: float = 0.0, *initial_data, **kwargs)
Group Annotations across Label Sets.
- add_label_set(label_set: konfuzio_sdk.data.LabelSet)
Add Label Set to label, if it does not exist.
- Parameters
label_set – Label set to add
- annotations(categories: List[konfuzio_sdk.data.Category], use_correct=True)
Return related Annotations. Consider that one Label can be used across Label Sets in multiple Categories.
- combined_tokens(categories: List[konfuzio_sdk.data.Category])
Create one OR Regex for all relevant Annotations tokens.
- evaluate_regex(regex, category: konfuzio_sdk.data.Category, annotations: Optional[List[konfuzio_sdk.data.Annotation]] = None, filtered_group=None, regex_quality=0)
Evaluate a regex on Categories.
Type of regex allows you to group regex by generality
- Example:
Three Annotations about the birthdate in two Documents and one regex to be evaluated 1.doc: “My was born at the 12th of December 1980, you could also say 12.12.1980.” (2 Annotations) 2.doc: “My was born at 12.06.1997.” (1 Annotations) regex: dd.dd.dddd (without escaped characters for easier reading) stats:
total_correct_findings: 2 correct_label_annotations: 3 total_findings: 2 –> precision 100 % num_docs_matched: 2 Project.documents: 2 –> Document recall 100%
- find_regex(category: konfuzio_sdk.data.Category) List[str]
Find the best combination of regex in the list of all regex proposed by Annotations.
- find_tokens(category: konfuzio_sdk.data.Category) List
Calculate the regex token of a label, which matches all offset_strings of all correct Annotations.
- regex(categories: List[konfuzio_sdk.data.Category], update=False) List
Calculate regex to be used in the LabelExtractionModel.
- tokens(categories: List[konfuzio_sdk.data.Category], update=False) dict
Calculate tokens to be used in the regex of the Label.
Label Set¶
- class konfuzio_sdk.data.LabelSet(project, labels=None, id_: Optional[int] = None, name: Optional[str] = None, name_clean: Optional[str] = None, is_default=False, categories=None, has_multiple_annotation_sets=False, **kwargs)
A Label Set is a group of labels.
- add_category(category: konfuzio_sdk.data.Category)
Add Category to Project, if it does not exist.
- Parameters
category – Category to add in the Project
- add_label(label)
Add Label to Label Set, if it does not exist.
- Parameters
label – Label ID to be added
Category¶
- class konfuzio_sdk.data.Category(project, id_: Optional[int] = None, name: Optional[str] = None, name_clean: Optional[str] = None, *args, **kwargs)
Group Documents in a Project.
- add_label_set(label_set)
Add Label Set to Category.
- documents()
Filter for Documents of this Category.
- property labels
Return the Labels that belong to the Category and it’s Label Sets.
- test_documents()
Filter for test Documents of this Category.
Document¶
- class konfuzio_sdk.data.Document(project: konfuzio_sdk.data.Project, id_: Optional[int] = None, file_url: Optional[str] = None, status=None, data_file_name: Optional[str] = None, is_dataset: Optional[bool] = None, dataset_status: Optional[int] = None, updated_at: Optional[str] = None, assignee: Optional[int] = None, category_template: Optional[int] = None, category: Optional[konfuzio_sdk.data.Category] = None, text: Optional[str] = None, bbox: Optional[dict] = None, pages: Optional[list] = None, update: Optional[bool] = None, copy_of_id: Optional[int] = None, *args, **kwargs)
Access the information about one document, which is available online.
- add_annotation(annotation: konfuzio_sdk.data.Annotation)
Add an annotation to a document.
- Parameters
annotation – Annotation to add in the document
- Returns
Input annotation.
- add_annotation_set(annotation_set: konfuzio_sdk.data.AnnotationSet)
Add the Annotation Sets to the document.
- add_page(page: konfuzio_sdk.data.Page)
Add a Page to a Document.
- annotation_sets()
Return Annotation Sets of Documents.
- annotations(label: Optional[konfuzio_sdk.data.Label] = None, use_correct: bool = True, start_offset: int = 0, end_offset: Optional[int] = None, fill: bool = False) List[konfuzio_sdk.data.Annotation]
Filter available annotations.
- Parameters
label – Label for which to filter the Annotations.
use_correct – If to filter by correct annotations.
- Returns
Annotations in the document.
- property bboxes: Dict[int, konfuzio_sdk.data.Bbox]
Use the cached bbox version.
- check_annotations(update_document: bool = False) bool
Check if Annotations are valid - no duplicates and correct Category.
- check_bbox() bool
Please see get_bbox of the Document.
- delete()
Delete all local information for the document.
- property document_folder
Get the path to the folder where all the Document information is cached locally.
- download_document_details()
Retrieve data from a Document online in case documented has finished processing.
- eval_dict(use_correct=False) List[dict]
Use this dict to evaluate Documents. The speciality: For every Span of an Annotation create one entry.
- evaluate_regex(regex, label: konfuzio_sdk.data.Label, annotations: Optional[List[konfuzio_sdk.data.Annotation]] = None, filtered_group=None)
Evaluate a regex based on the Document.
- property file_path
Return path to file.
- get_annotation_set_by_id(id_: int) konfuzio_sdk.data.AnnotationSet
Return a Label Set by ID.
- Parameters
id – ID of the Label Set to get.
- get_annotations() List[konfuzio_sdk.data.Annotation]
Get Annotations of the Document.
- get_bbox() Dict
Get bbox information per character of file. We don’t store bbox as an attribute to save memory.
- Returns
Bounding box information per character in the document.
- get_file(ocr_version: bool = True, update: bool = False)
Get OCR version of the original file.
- Parameters
ocr_version – Bool to get the ocr version of the original file
update – Update the downloaded file even if it is already available
- Returns
Path to the selected file.
- get_images(update: bool = False)
Get Document Pages as PNG images.
- Parameters
update – Update the downloaded images even they are already available
- Returns
Path to PNG files.
- get_page_by_index(page_index: int)
Return the Page by index.
- get_text_in_bio_scheme(update=False) List[Tuple[str, str]]
Get the text of the Document in the BIO scheme.
- Parameters
update – Update the bio annotations even they are already available
- Returns
list of tuples with each word in the text and the respective label
- property hocr
Get HOCR of Document. Once loaded stored in memory.
- property no_label_annotation_set: konfuzio_sdk.data.AnnotationSet
Return the Annotation Set for project.no_label Annotations.
We need to load the Annotation Sets from Server first (call self.annotation_sets()). If we create the no_label_annotation_set in the first place, the data from the Server is not be loaded anymore because _annotation_sets will no longer be None.
- property number_of_lines: int
Calculate the number of lines.
- property number_of_pages: int
Calculate the number of pages.
- property ocr_file_path
Return path to OCR PDF file.
- pages() List[konfuzio_sdk.data.Page]
Get Pages of Document.
- regex(start_offset: int, end_offset: int, search=None, max_findings_per_page=100) List[str]
Suggest a list of regex which can be used to get the Span of a document.
- spans(label: Optional[konfuzio_sdk.data.Label] = None, use_correct: bool = False, start_offset: int = 0, end_offset: Optional[int] = None, fill: bool = False) List[konfuzio_sdk.data.Span]
Return all Spans of the Document.
- property text
Get Document text. Once loaded stored in memory.
- update()
Update document information.
- view_annotations() List[konfuzio_sdk.data.Annotation]
Get the best Annotations, where the Spans are not overlapping.
Project¶
- class konfuzio_sdk.data.Project(id_: Optional[int], project_folder=None, update=False, **kwargs)
Access the information of a Project.
- add_category(category: konfuzio_sdk.data.Category)
Add Category to Project, if it does not exist.
- Parameters
category – Category to add in the Project
- add_document(document: konfuzio_sdk.data.Document)
Add Document to Project, if it does not exist.
- add_label(label: konfuzio_sdk.data.Label)
Add Label to Project, if it does not exist.
- Parameters
label – Label to add in the Project
- add_label_set(label_set: konfuzio_sdk.data.LabelSet)
Add Label Set to Project, if it does not exist.
- Parameters
label_set – Label Set to add in the Project
- delete()
Delete the Project folder.
- property documents
Return Documents with status training.
- property documents_folder: str
Calculate the regex folder of the Project.
- property excluded_documents
Return Documents which have been excluded.
- get(update=False)
Access meta information of the Project.
- Parameters
update – Update the downloaded information even it is already available
- get_categories()
Load Categories for all Label Sets in the Project.
- get_category_by_id(id_: int) konfuzio_sdk.data.Category
Return a Category by ID.
- Parameters
id – ID of the Category to get.
- get_document_by_id(document_id: int) konfuzio_sdk.data.Document
Return Document by its ID.
- get_label_by_id(id_: int) konfuzio_sdk.data.Label
Return a Label by ID.
- Parameters
id – ID of the Label to get.
- get_label_by_name(name: str) konfuzio_sdk.data.Label
Return Label by its name.
- get_label_set_by_id(id_: int) konfuzio_sdk.data.LabelSet
Return a Label Set by ID.
- Parameters
id – ID of the Label Set to get.
- get_label_set_by_name(name: str) konfuzio_sdk.data.LabelSet
Return a Label Set by ID.
- get_label_sets()
Get Label Sets in the Project.
- get_labels() konfuzio_sdk.data.Label
Get ID and name of any Label in the Project.
- get_meta()
Get the list of all Documents in the Project and their information.
- Returns
Information of the Documents in the Project.
- init_or_update_document()
Initialize Document to then decide about full, incremental or no update.
- lose_weight()
Delete data of the instance.
- property model_folder: str
Calculate the model folder of the Project.
- property no_status_documents
Return Documents with status test.
- property preparation_documents
Return Documents with status test.
- property project_folder: str
Calculate the data document_folder of the Project.
- property regex_folder: str
Calculate the regex folder of the Project.
- property test_documents
Return Documents with status test.
- property virtual_documents
Return Documents created virtually.
- write_project_files()
Overwrite files with Project, Label, Label Set information.
Tokenizer¶
Abstract Tokenizer¶
Generic tokenizer.
- class konfuzio_sdk.tokenizer.base.AbstractTokenizer¶
Abstract definition of a Tokenizer.
- evaluate(document: konfuzio_sdk.data.Document) pandas.core.frame.DataFrame ¶
Compare a Document with its tokenized version.
- Parameters
document – Document to evaluate
- Returns
Evaluation DataFrame and Processing time DataFrame.
- abstract fit(category: konfuzio_sdk.data.Category)¶
Fit the tokenizer accordingly with the Documents of the Category.
- get_runtime_info() pandas.core.frame.DataFrame ¶
Get the processing runtime information as DataFrame.
- Returns
processing time Dataframe containing the processing duration of all steps of the tokenization.
- lose_weight()¶
Delete processing steps.
- missing_spans(document: konfuzio_sdk.data.Document) konfuzio_sdk.data.Document ¶
Apply a Tokenizer on a list of Document and remove all Spans that can be found.
Use this approach to sequentially work on remaining Spans after a Tokenizer ran on a List of Documents.
- Parameters
tokenizer – A Tokenizer that runs on a list of Documents
documents – Any list of Documents
- Returns
A new Document containing all missing Spans contained in a copied version of all Documents.
- abstract tokenize(document: konfuzio_sdk.data.Document)¶
Create Annotations with 1 Span based on the result of the Tokenizer.
- class konfuzio_sdk.tokenizer.base.ListTokenizer(tokenizers: List[konfuzio_sdk.tokenizer.base.AbstractTokenizer])¶
Use multiple tokenizers.
- fit(category: konfuzio_sdk.data.Category)¶
Call fit on all tokenizers.
- lose_weight()¶
Delete processing steps.
- tokenize(document: konfuzio_sdk.data.Document) konfuzio_sdk.data.Document ¶
Run tokenize in the given order on a Document.
Rule Based Tokenizer¶
Regex tokenizers.
- class konfuzio_sdk.tokenizer.regex.CapitalizedTextTokenizer¶
Tokenizer based on capitalized text.
- Example:
“Company is Company A&B GmbH now” -> “Company A&B GmbH”
- class konfuzio_sdk.tokenizer.regex.ColonOrWhitespacePrecededTokenizer¶
Tokenizer based on text preceded by colon.
- Example:
“write to: name” -> “name”
- class konfuzio_sdk.tokenizer.regex.ColonPrecededTokenizer¶
Tokenizer based on text preceded by colon.
- Example:
“write to: name” -> “name”
- class konfuzio_sdk.tokenizer.regex.ConnectedTextTokenizer¶
Tokenizer based on text connected by 1 whitespace.
- Example:
r”This is na description. Occupies a paragraph.” -> “This is”, “a description. Occupies a paragraph.”
- class konfuzio_sdk.tokenizer.regex.LineUntilCommaTokenizer¶
Tokenizer based on text preceded by colon.
- Example:
“n Company und A&B GmbH,n” -> “Company und A&B GmbH”
- class konfuzio_sdk.tokenizer.regex.NonTextTokenizer¶
Tokenizer based on non text - numbers and separators.
- Example:
“date 01. 01. 2022” -> “01. 01. 2022”
- class konfuzio_sdk.tokenizer.regex.NumbersTokenizer¶
Tokenizer based on numbers.
- Example:
“N. 1242022 123 ” -> “1242022 123”
- class konfuzio_sdk.tokenizer.regex.RegexTokenizer(regex: str)¶
Tokenizer based on a single regex.
- fit(category: konfuzio_sdk.data.Category)¶
Fit the tokenizer accordingly with the Documents of the Category.
- tokenize(document: konfuzio_sdk.data.Document) konfuzio_sdk.data.Document ¶
Create Annotations with 1 Span based on the result of the Tokenizer.
- Parameters
document – Document to tokenize, can have been tokenized before
- Returns
Document with Spans created by the Tokenizer.
- class konfuzio_sdk.tokenizer.regex.WhitespaceNoPunctuationTokenizer¶
Tokenizer based on whitespaces without punctuation.
- Example:
“street Name 1-2b,” -> “street”, “Name”, “1-2b”
- class konfuzio_sdk.tokenizer.regex.WhitespaceTokenizer¶
Tokenizer based on whitespaces.
- Example:
“street Name 1-2b,” -> “street”, “Name”, “1-2b,”