Source Code Reference

API

[source]

Connect to the Konfuzio Server to receive or send data.

class konfuzio_sdk.api.TimeoutHTTPAdapter(timeout, *args, **kwargs)

Combine a retry strategy with a timeout strategy.

send(request, *args, **kwargs)

Use timeout policy if not otherwise declared.

konfuzio_sdk.api.create_label(project_id: int, label_name: str, label_sets: list, session=<requests.sessions.Session object>, description=None, has_multiple_top_candidates=None, data_type=None) List[dict]

Create a Label and associate it with Labels sets.

Parameters:
  • project_id – Project ID where to create the label

  • label_name – Name for the label

  • label_sets – Label sets that use the label

  • session – Konfuzio session with Retry and Timeout policy

  • description – Test to describe the label

  • has_multiple_top_candidates – If multiple Annotations can be correct in a single Annotation Set

  • data_type – Expected data type of any Span of Annotations related to this Label.

Returns:

Label ID in the Konfuzio Server.

konfuzio_sdk.api.create_new_project(project_name, session=<requests.sessions.Session object>)

Create a new Project for the user.

Parameters:
  • project_name – name of the project you want to create

  • session – Konfuzio session with Retry and Timeout policy

Returns:

Response object

konfuzio_sdk.api.delete_document_annotation(document_id: int, annotation_id: int, project_id: int, session=<requests.sessions.Session object>)

Delete a given Annotation of the given document.

Parameters:
  • document_id – ID of the document

  • annotation_id – ID of the annotation

  • project_id – ID of the project

  • session – Konfuzio session with Retry and Timeout policy

Returns:

Response status.

konfuzio_sdk.api.delete_file_konfuzio_api(document_id: int, session=<requests.sessions.Session object>)

Delete Document by ID via Konfuzio API.

Parameters:
  • document_id – ID of the document

  • session – Konfuzio session with Retry and Timeout policy

Returns:

File id_ in Konfuzio Server.

konfuzio_sdk.api.download_file_konfuzio_api(document_id: int, ocr: bool = True, session=<requests.sessions.Session object>)

Download file from the Konfuzio server using the Document id_.

Django authentication is form-based, whereas DRF uses BasicAuth.

Parameters:
  • document_id – ID of the document

  • ocr – Bool to get the ocr version of the document

  • session – Konfuzio session with Retry and Timeout policy

Returns:

The downloaded file.

konfuzio_sdk.api.get_document_details(document_id: int, project_id: int, session=<requests.sessions.Session object>, extra_fields: str = '')

Use the text-extraction server to retrieve the data from a document.

Parameters:
  • document_id – ID of the document

  • project_id – ID of the Project

  • session – Konfuzio session with Retry and Timeout policy

  • extra_fields – Retrieve bounding boxes and HOCR from document, too. Can be “bbox,hocr”, it’s a hotfix

Returns:

Data of the document.

konfuzio_sdk.api.get_meta_of_files(project_id: int, limit: int = 1000, session=<requests.sessions.Session object>) List[dict]

Get meta information of Documents in a Project.

Parameters:
  • project_id – ID of the Project

  • limit – Number of Documents per Page

  • session – Konfuzio session with Retry and Timeout policy

Returns:

Sorted Documents names in the format {id_: ‘pdf_name’}.

konfuzio_sdk.api.get_project_details(project_id: int, session=<requests.sessions.Session object>) dict

Get Label Sets available in Project.

Parameters:
  • project_id – ID of the Project

  • session – Konfuzio session with Retry and Timeout policy

Returns:

Sorted Label Sets.

konfuzio_sdk.api.get_project_list(session=<requests.sessions.Session object>)

Get the list of all Projects for the user.

Parameters:

session – Konfuzio session with Retry and Timeout policy

Returns:

Response object

konfuzio_sdk.api.get_results_from_segmentation(doc_id: int, project_id: int, session=<requests.sessions.Session object>) List[List[dict]]

Get bbox results from segmentation endpoint.

Parameters:
  • doc_id – ID of the document

  • project_id – ID of the Project.

  • session – Konfuzio session with Retry and Timeout policy

konfuzio_sdk.api.init_env(user: str, password: str, host: str = 'https://app.konfuzio.com', working_directory='/home/runner/work/konfuzio-sdk/konfuzio-sdk', file_ending: str = '.env')

Add the .env file to the working directory.

Parameters:
  • user – Username to log in to the host

  • password – Password to log in to the host

  • host – URL of host.

  • working_directory – Directory where file should be added

  • file_ending – Ending of file.

konfuzio_sdk.api.post_document_annotation(document_id: int, project_id: int, label_id: int, label_set_id: int, confidence: ~typing.Optional[float] = None, revised: bool = False, is_correct: bool = False, annotation_set=None, session=<requests.sessions.Session object>, **kwargs)

Add an Annotation to an existing document.

For the Annotation Set definition, we can: - define the Annotation Set id_ where the Annotation should belong (annotation_set=x (int), define_annotation_set=True) - pass it as None and a new Annotation Set will be created (annotation_set=None, define_annotation_set=True) - do not pass the Annotation Set field and a new Annotation Set will be created if does not exist any or the Annotation will be added to the previous Annotation Set created (define_annotation_set=False)

Parameters:
  • document_id – ID of the file

  • project_id – ID of the project

  • label_id – ID of the Label

  • label_set_id – ID of the Label Set where the Annotation belongs

  • confidence – Confidence of the Annotation still called Accuracy by text-annotation

  • revised – If the Annotation is revised or not (bool)

  • is_correct – If the Annotation is corrected or not (bool)

  • annotation_set – Annotation Set to connect to the server

  • session – Konfuzio session with Retry and Timeout policy

Returns:

Response status.

konfuzio_sdk.api.update_document_konfuzio_api(document_id: int, session=<requests.sessions.Session object>, **kwargs)

Update an existing Document via Konfuzio API.

Parameters:
  • document_id – ID of the document

  • session – Konfuzio session with Retry and Timeout policy

Returns:

Response status.

konfuzio_sdk.api.upload_ai_model(ai_model_path: str, category_ids: ~typing.Optional[~typing.List[int]] = None, session=<requests.sessions.Session object>)

Upload an ai_model to the text-annotation server.

Parameters:
  • ai_model_path – Path to the ai_model

  • category_ids – define ids of Categories the model should become available after upload.

  • session – session to connect to server

Returns:

konfuzio_sdk.api.upload_file_konfuzio_api(filepath: str, project_id: int, dataset_status: int = 0, session=<requests.sessions.Session object>, category_id: ~typing.Union[None, int] = None)

Upload Document to Konfuzio API.

Parameters:
  • filepath – Path to file to be uploaded

  • project_id – ID of the project

  • session – Konfuzio session with Retry and Timeout policy

  • dataset_status – Set data set status of the document.

  • category_id – Define a Category the Document belongs to

Returns:

Response status.

CLI

[source]

Command Line interface to the konfuzio_sdk package.

konfuzio_sdk.cli.credentials()

Retrieve user input.

konfuzio_sdk.cli.main()

CLI of Konfuzio SDK.

Data

[source]

Handle data from the API.

Span

class konfuzio_sdk.data.Span(start_offset: int, end_offset: int, annotation=None)

A Span is a sequence of characters or whitespaces without line break.

bbox()

Calculate the bounding box of a text sequence.

eval_dict()

Return any information needed to evaluate the Span.

property line_index: int

Calculate the index of the line on which the span starts, first line has index 0.

property normalized

Normalize the offset string.

property offset_string: Optional[str]

Calculate the offset string of a Span.

property page_height: Optional[float]

Get width of the page of the Span. Used to calculate relative position on page.

property page_index: Optional[int]

Calculate the index of the page on which the span starts, first page has index 0.

property page_width: Optional[float]

Get width of the page of the Span. Used to calculate relative position on page.

Annotation

class konfuzio_sdk.data.Annotation(document: Document, annotation_set_id: Optional[int] = None, annotation_set: Optional[AnnotationSet] = None, label: Optional[Union[int, Label]] = None, label_set_id: Optional[int] = None, label_set: Union[None, LabelSet] = None, is_correct: bool = False, revised: bool = False, normalized=None, id_: Optional[int] = None, spans=None, accuracy: Optional[float] = None, confidence: Optional[float] = None, created_by: Optional[int] = None, revised_by: Optional[int] = None, translated_string: Optional[str] = None, custom_offset_string: bool = False, offset_string: str = False, *args, **kwargs)

Hold information that a Label, Label Set and Annotation Set has been assigned to and combines Spans.

add_span(span: Span)

Add a Span to an Annotation incl. a duplicate check per Annotation.

delete() None

Delete Annotation online.

property end_offset: int

Legacy: One Annotation can have multiple end offsets.

property eval_dict: List[dict]

Calculate the Span information to evaluate the Annotation.

get_link()

Get link to the Annotation in the SmartView.

property is_multiline: int

Calculate if Annotation spans multiple lines of text.

property is_online: Optional[int]

Define if the Annotation is saved to the server.

property normalize: str

Provide one normalized offset string due to legacy.

property offset_string: List[str]

View the string representation of the Annotation.

regex()

Return regex of this annotation.

regex_annotation_generator(regex_list) List[Span]

Build Spans without Labels by regexes.

Returns:

Return sorted list of Spans by start_offset

save(document_annotations: Optional[list] = None) bool

Save Annotation online.

If there is already an Annotation in the same place as the current one, we will not be able to save the current annotation.

In that case, we get the id_ of the original one to be able to track it. The verification of the duplicates is done by checking if the offsets and Label match with any Annotations online. To be sure that we are comparing with the information online, we need to have the Document updated. The update can be done after the request (per annotation) or the updated Annotations can be passed as input of the function (advisable when dealing with big Documents or Documents with many Annotations).

Parameters:

document_annotations – Annotations in the Document (list)

Returns:

True if new Annotation was created

property spans: List[Span]

Return default entry to get all Spans of the Annotation.

property start_offset: int

Legacy: One Annotation can have multiple start offsets.

token_append(new_regex, regex_quality: int)

Append token if it is not a duplicate.

tokens() List[str]

Create a list of potential tokens based on Spans of this Annotation.

Annotation Set

class konfuzio_sdk.data.AnnotationSet(document, label_set: LabelSet, id_: Optional[int] = None, **kwargs)

An Annotation Set is a group of Annotations. The Labels of those Annotations refer to the same Label Set.

property annotations

All Annotations currently in this Annotation Set.

property end_offset

Calculate the end based on all Annotations currently in this Annotation Set.

lose_weight()

Delete data of the instance.

property start_line_index

Calculate starting line of this Annotation Set.

property start_offset

Calculate the earliest start based on all Annotations currently in this Annotation Set.

Label

class konfuzio_sdk.data.Label(project, id_: Optional[int] = None, text: Optional[str] = None, get_data_type_display: str = 'Text', text_clean: Optional[str] = None, description: Optional[str] = None, label_sets=None, has_multiple_top_candidates: bool = False, threshold: Optional[float] = None, *initial_data, **kwargs)

Group Annotations across Label Sets.

add_label_set(label_set: LabelSet)

Add Label Set to label, if it does not exist.

Parameters:

label_set – Label set to add

annotations(categories: List[Category], use_correct=True)

Return related Annotations. Consider that one Label can be used across Label Sets in multiple Categories.

check_tokens(categories: List[Category])

Check if a list of regex do find the annotations. Log Annotations that we cannot find.

combined_tokens(categories: List[Category])

Create one OR Regex for all relevant Annotations tokens.

evaluate_regex(regex, category: Category, annotations: Optional[List[Annotation]] = None, filtered_group=None, regex_quality=0)

Evaluate a regex on Categories.

Type of regex allows you to group regex by generality

Example:

Three Annotations about the birthdate in two Documents and one regex to be evaluated 1.doc: “My was born at the 12th of December 1980, you could also say 12.12.1980.” (2 Annotations) 2.doc: “My was born at 12.06.1997.” (1 Annotations) regex: dd.dd.dddd (without escaped characters for easier reading) stats:

total_correct_findings: 2 correct_label_annotations: 3 total_findings: 2 –> precision 100 % num_docs_matched: 2 Project.documents: 2 –> Document recall 100%

find_regex(category: Category, annotations: Optional[List[Annotation]] = None) List[str]

Find the best combination of regex in the list of all regex proposed by Annotations.

find_tokens(category: Category) List

Calculate the regex token of a label, which matches all offset_strings of all correct Annotations.

regex(categories: List[Category], update=False) List

Calculate regex to be used in the LabelExtractionModel.

tokens(categories: List[Category], update=False) dict

Calculate tokens to be used in the regex of the Label.

Label Set

class konfuzio_sdk.data.LabelSet(project, labels=None, id_: Optional[int] = None, name: Optional[str] = None, name_clean: Optional[str] = None, is_default=False, categories=None, has_multiple_annotation_sets=False, **kwargs)

A Label Set is a group of labels.

add_category(category: Category)

Add Category to Project, if it does not exist.

Parameters:

category – Category to add in the Project

add_label(label)

Add Label to Label Set, if it does not exist.

Parameters:

label – Label ID to be added

Category

class konfuzio_sdk.data.Category(project, id_: Optional[int] = None, name: Optional[str] = None, name_clean: Optional[str] = None, *args, **kwargs)

Group Documents in a Project.

add_label_set(label_set)

Add Label Set to Category.

documents()

Filter for Documents of this Category.

test_documents()

Filter for test Documents of this Category.

Document

class konfuzio_sdk.data.Document(project, id_: Optional[int] = None, file_url: Optional[str] = None, status=None, data_file_name: Optional[str] = None, is_dataset: Optional[bool] = None, dataset_status: Optional[int] = None, updated_at: Optional[str] = None, assignee: Optional[int] = None, category_template: Optional[int] = None, category: Optional[Category] = None, text: Optional[str] = None, bbox: Optional[dict] = None, pages: Optional[list] = None, update: Optional[bool] = None, copy_of_id: Optional[int] = None, *args, **kwargs)

Access the information about one document, which is available online.

add_annotation(annotation: Annotation)

Add an annotation to a document.

Parameters:

annotation – Annotation to add in the document

Returns:

Input annotation.

add_annotation_set(annotation_set: AnnotationSet)

Add the Annotation Sets to the document.

annotation_sets()

Return Annotation Sets of Documents.

annotations(label: Optional[Label] = None, use_correct: bool = True, start_offset: int = 0, end_offset: Optional[int] = None, fill: bool = False) List[Annotation]

Filter available annotations.

Parameters:
  • label – Label for which to filter the Annotations.

  • use_correct – If to filter by correct annotations.

Returns:

Annotations in the document.

check_annotations(update_document: bool = False) bool

Check if Annotations are valid - no duplicates and correct Category.

check_bbox(update_document: bool = False) bool

Check match between the text in the Document and its bounding boxes.

Check that every character in the text is part of the bbox dictionary and that each bbox has a match in the document text. Also, validate the coordinates of the bounding boxes.

delete()

Delete all local information for the document.

property document_folder

Get the path to the folder where all the Document information is cached locally.

download_document_details()

Retrieve data from a Document online in case documented has finished processing.

eval_dict(use_correct=False) List[dict]

Use this dict to evaluate Documents. The speciality: For every Span of an Annotation create one entry.

evaluate_regex(regex, label: Label, annotations: Optional[List[Annotation]] = None, filtered_group=None)

Evaluate a regex based on the Document.

property file_path

Return path to file.

get_annotation_set_by_id(id_: int) AnnotationSet

Return a Label Set by ID.

Parameters:

id – ID of the Label Set to get.

get_annotations() List[Annotation]

Get Annotations of the Document.

get_bbox()

Get bbox information per character of file. We don’t store bbox as an attribute to save memory.

Returns:

Bounding box information per character in the document.

get_file(ocr_version: bool = True, update: bool = False)

Get OCR version of the original file.

Parameters:
  • ocr_version – Bool to get the ocr version of the original file

  • update – Update the downloaded file even if it is already available

Returns:

Path to the selected file.

get_images(update: bool = False)

Get Document pages as png images.

Parameters:

update – Update the downloaded images even they are already available

Returns:

Path to OCR file.

get_text_in_bio_scheme(update=False) List[Tuple[str, str]]

Get the text of the Document in the BIO scheme.

Parameters:

update – Update the bio annotations even they are already available

Returns:

list of tuples with each word in the text and the respective label

property hocr

Get HOCR of Document. Once loaded stored in memory.

property is_online: Optional[int]

Define if the Document is saved to the server.

property no_label_annotation_set: AnnotationSet

Return the Annotation Set for project.no_label Annotations.

We need to load the Annotation Sets from Server first (call self.annotation_sets()). If we create the no_label_annotation_set in the first place, the data from the Server is not be loaded anymore because _annotation_sets will no longer be None.

property number_of_lines: int

Calculate the number of lines.

property number_of_pages: int

Calculate the number of pages.

property ocr_file_path

Return path to OCR PDF file.

property pages

Get Pages of Document. Once loaded stored in memory.

regex(start_offset: int, end_offset: int, search=None, max_findings_per_page=15) List[str]

Suggest a list of regex which can be used to get the Span of a document.

property text

Get Document text. Once loaded stored in memory.

update()

Update document information.

Project

class konfuzio_sdk.data.Project(id_: Optional[int], project_folder=None, update=False, **kwargs)

Access the information of a Project.

add_category(category: Category)

Add Category to Project, if it does not exist.

Parameters:

category – Category to add in the Project

add_document(document: Document)

Add Document to Project, if it does not exist.

add_label(label: Label)

Add Label to Project, if it does not exist.

Parameters:

label – Label to add in the Project

add_label_set(label_set: LabelSet)

Add Label Set to Project, if it does not exist.

Parameters:

label_set – Label Set to add in the Project

delete()

Delete the Project folder.

property documents

Return Documents with status training.

property documents_folder: str

Calculate the regex folder of the Project.

property excluded_documents

Return Documents which have been excluded.

get(update=False)

Access meta information of the Project.

Parameters:

update – Update the downloaded information even it is already available

get_categories()

Load Categories for all Label Sets in the Project.

get_category_by_id(id_: int) Category

Return a Category by ID.

Parameters:

id – ID of the Category to get.

get_document_by_id(document_id: int) Document

Return Document by its ID.

get_label_by_id(id_: int) Label

Return a Label by ID.

Parameters:

id – ID of the Label to get.

get_label_by_name(name: str) Label

Return Label by its name.

get_label_set_by_id(id_: int) LabelSet

Return a Label Set by ID.

Parameters:

id – ID of the Label Set to get.

get_label_set_by_name(name: str) LabelSet

Return a Label Set by ID.

get_label_sets()

Get Label Sets in the Project.

get_labels()

Get ID and name of any Label in the Project.

get_meta()

Get the list of all Documents in the Project and their information.

Returns:

Information of the Documents in the Project.

init_or_update_document()

Initialize Document to then decide about full, incremental or no update.

lose_weight()

Delete data of the instance.

property model_folder: str

Calculate the model folder of the Project.

property no_label: Label

Get the “No Label” which is used by a Tokenizer.

property no_label_set: Label

Get the “No Label Set” which is used by a Tokenizer.

property no_status_documents

Return Documents with status test.

property preparation_documents

Return Documents with status test.

property project_folder: str

Calculate the data document_folder of the Project.

property regex_folder: str

Calculate the regex folder of the Project.

property test_documents

Return Documents with status test.

property virtual_documents

Return Documents created virtually.

write_project_files()

Overwrite files with Project, Label, Label Set information.

Utils

[source]

Utils for the konfuzio sdk package.

konfuzio_sdk.utils.amend_file_name(file_name: str, append_text: str = '', new_extension: Optional[str] = None) str

Append text to a filename in front of extension.

example found here: https://stackoverflow.com/a/37487898

Parameters:
  • new_extension – Change the file extension

  • file_path – Name of a file, e.g. file.pdf

  • append_text – Text you you want to append between file name ane extension

Returns:

extended path to file

konfuzio_sdk.utils.amend_file_path(file_path: str, append_text: str = '', new_extension: Optional[str] = None)

Similar to amend_file_name however the file_name is interpreted as a full path.

Parameters:
  • new_extension – Change the file extension

  • file_path – Name of a file, e.g. file.pdf

  • append_text – Text you you want to append between file name ane extension

Returns:

extended path to file

konfuzio_sdk.utils.convert_to_bio_scheme(text: str, annotations: List) List[Tuple[str, str]]

Mark all the entities in the text as per the BIO scheme.

The splitting is using the sequence of words, expecting some characters like “.” a separate token.

Hello O , O it O ‘s O Konfuzio B-ORG . O

The start and end offsets are considered having the origin in the beginning of the input text. If only part of the text of the Document is passed, the start and end offsets of the Annotations must be adapted first.

Parameters:
  • text – text to be annotated in the bio scheme

  • annotations – annotations in the Document with start and end offset and Label name

Returns:

list of tuples with each word in the text an the respective Label

konfuzio_sdk.utils.does_not_raise()

Serve a complement to raise, no-op context manager does_not_raise.

docs.pytest.org/en/latest/example/parametrize.html#parametrizing-conditional-raising

konfuzio_sdk.utils.get_bbox(bbox, start_offset: int, end_offset: int) Dict

Get single bbox for offset_string.

Given a bbox (a dictionary containing a bbox for every character in a document) and a start/end_offset into that document, create a new bbox which covers every character bbox between the given start and end offset.

Pages are zero indexed, i.e. the first page has page_number = 0.

konfuzio_sdk.utils.get_file_type(input_file: Optional[Union[str, BytesIO, bytes]] = None) str

Get the type of a file.

Parameters:

input_file – Path to the file or file in bytes format

Returns:

Name of file type

konfuzio_sdk.utils.get_file_type_and_extension(input_file: Optional[Union[str, BytesIO, bytes]] = None) Tuple[str, str]

Get the type of a file via the filetype library, which checks the magic bytes to see the internal format.

Parameters:

input_file – Path to the file or file in bytes format

Returns:

Name of file type

konfuzio_sdk.utils.get_id(a_string, include_time: bool = False) int

Generate a unique ID.

Parameters:
  • a_string – String used to generating the unique ID

  • include_time – Bool to include the time in the unique ID

Returns:

Unique ID

konfuzio_sdk.utils.get_missing_offsets(start_offset: int, end_offset: int, annotated_offsets: List[range])

Calculate the missing characters.

Parameters:
  • start_offset (int) – Start of the overall text as index

  • end_offset – End of the overall text as index

Param:

A list integers, where one character presents a character. It may be outside the start and end offset.

Todo

How do we handle tokens that are smaller / larger than the correct Annotations? See link

>>> get_missing_offsets(start_offset=0, end_offset=170, annotated_offsets=[range(66, 78), range(159, 169)])
[range(0, 66), range(78, 159), range(169, 170)]
konfuzio_sdk.utils.get_sentences(text: str, offsets_map: Optional[dict] = None, language: str = 'german') List[dict]

Split a text into sentences using the sentence tokenizer from the package nltk.

Parameters:
  • text – Text to split into sentences

  • offsets_map – mapping between the position of the character in the input text and the offset in the text

of the document :param language: language of the text :return: List with a dict per sentence with its text and its start and end offsets in the text of the document.

konfuzio_sdk.utils.get_timestamp(konfuzio_format='%Y-%m-%d-%H-%M-%S') str

Return formatted timestamp.

Parameters:

konfuzio_format – Format of the timestamp (e.g. year-month-day-hour-min-sec)

Returns:

Timestamp

konfuzio_sdk.utils.is_file(file_path, raise_exception=True, maximum_size=100000000, allow_empty=False) bool

Check if file is available or raise error if it does not exist.

Parameters:
  • file_path – Path to the file to be checked

  • raise_exception – Will raise an exception if file is not available

  • maximum_size – Maximum size of the expected file, default < 100 mb

  • allow_empty – Bool to allow empty files

Returns:

True or false depending on the existence of the file

konfuzio_sdk.utils.iter_before_and_after(iterable, before=1, after=None, fill=None)

Iterate and provide before and after element. Generalized from http://stackoverflow.com/a/1012089.

konfuzio_sdk.utils.load_image(input_file: Union[str, BytesIO])

Load an image by path or via io.Bytes, e.g. via download by URL.

Parameters:

input_file – Path to image or image in bytes format

Returns:

Loaded image

konfuzio_sdk.utils.map_offsets(characters_bboxes: list) dict

Map the position of the character to its offset.

E.g.: characters: x, y, z, w characters offsets: 2, 3, 20, 22

The first character (x) has the offset 2. The fourth character (w) has the offset 22. …

offsets_map: {0: 2, 1: 3, 2: 20, 3: 22}

Parameters:

characters_bboxes – Bounding boxes information of the characters.

Returns:

Mapping of the position of the characters and its offsets.

konfuzio_sdk.utils.sdk_isinstance(instance, klass)

Implement a custom isinstance which is compatible with cloudpickle saving by value.

When using cloudpickle with “register_pickle_by_value” the classes of “konfuzio.data” will be loaded in the “types” module. For this case the builtin method “isinstance” will return False because it tries to compare “types.Document” with “konfuzio_sdk.data.Document”.

konfuzio_sdk.utils.slugify(value)

Taken from https://github.com/django/django/blob/master/django/utils/text.py.

Convert to ASCII if ‘allow_unicode’ is False. Convert spaces or repeated dashes to single dashes. Remove characters that aren’t alphanumerics, underscores, or hyphens. Convert to lowercase. Also strip leading and trailing whitespace, dashes, and underscores.