{ "cells": [ { "cell_type": "markdown", "id": "56606969", "metadata": {}, "source": [ "## Find possible outliers among the ground-truth Annotations\n", "\n", "---\n", "\n", "**Prerequisites:**\n", "\n", "- Data Layer concepts of Konfuzio: Annotation, Label, Document, Project\n", "- Regular expressions\n", "\n", "**Difficulty:** Medium\n", "\n", "**Goal:** Learn how to spot potentially wrong Annotations of a particular Label after your Documents have been processed via Information Extraction.\n", "\n", "---\n", "\n", "### Environment\n", "You need to install the Konfuzio SDK before diving into the tutorial. \\\n", "To get up and running quickly, you can use our Colab Quick Start notebook. \\\n", "\"Open\n", "\n", "As an alternative you can follow the [installation section](../get_started.html#install-sdk) to install and initialize the Konfuzio SDK locally or on an environment of your choice.\n", "\n", "### Introduction\n", "\n", "If you want to ensure that Annotations of a Label are consistent and check for possible outliers, you can use one of the `Label` class's methods. There are three of them available.\n", "\n", "#### Outliers by regex\n", "\n", "`Label.get_outliers_by_regex` looks for the \"worst\" regexes used to find the Annotations under a given Label. \"Worst\" is determined by\n", "the number of True Positives (correctly extracted Annotations) calculated when evaluating the regexes' performance. The method returns Annotations predicted by the regexes with the least amount of True Positives. By default, the method returns Annotations retrieved by the regex that performs on the level of 10% in comparison to the best one." ] }, { "cell_type": "code", "execution_count": null, "id": "7d49c285", "metadata": { "editable": true, "slideshow": { "slide_type": "" }, "tags": [ "remove-cell" ] }, "outputs": [], "source": [ "import logging\n", "\n", "logging.getLogger(\"konfuzio_sdk\").setLevel(logging.ERROR)\n", "YOUR_PROJECT_ID = 46\n", "YOUR_LABEL_NAME = 'Bank inkl. IBAN'\n", "TOP_WORST = 1.0" ] }, { "cell_type": "markdown", "id": "236ee62e", "metadata": {}, "source": [ "Initialize the Project, select the Label you want to assess and run the method, passing all Categories that are referring to the Label Set of a given Label as an input. TOP_WORST is threshold for determining what percentage of the worst regexes' output to return and can be also modified manually; by default it is 0.1." ] }, { "cell_type": "code", "execution_count": null, "id": "336623db", "metadata": { "editable": true, "slideshow": { "slide_type": "" }, "vscode": { "languageId": "plaintext" } }, "outputs": [], "source": [ "from konfuzio_sdk.data import Project\n", "\n", "project = Project(id_=YOUR_PROJECT_ID, update=True)\n", "label = project.get_label_by_name(YOUR_LABEL_NAME)\n", "outliers = label.get_probable_outliers_by_regex(project.categories, top_worst_percentage=TOP_WORST)\n", "print([annotation.offset_string for annotation in outliers])" ] }, { "cell_type": "markdown", "id": "952a0f2b", "metadata": {}, "source": [ "#### Outliers by confidence\n", "\n", "`Label.get_probable_outliers_by_confidence` looks for the Annotations with the least confidence level, provided it is lower\n", "than the specified threshold (the default threshold is 0.5). The method accepts an instance of `EvaluationExtraction` class as an input and uses confidence predictions from there." ] }, { "cell_type": "code", "execution_count": null, "id": "ff6956de", "metadata": { "editable": true, "slideshow": { "slide_type": "" }, "tags": [ "remove-cell" ] }, "outputs": [], "source": [ "YOUR_LABEL_NAME = 'Austellungsdatum'" ] }, { "cell_type": "markdown", "id": "12a82bbe", "metadata": {}, "source": [ "Initialize the Project and select the Label you want to assess." ] }, { "cell_type": "code", "execution_count": null, "id": "a55567e7", "metadata": { "editable": true, "slideshow": { "slide_type": "" } }, "outputs": [], "source": [ "from konfuzio_sdk.data import Project\n", "\n", "project = Project(id_=YOUR_PROJECT_ID)\n", "label = project.get_label_by_name(YOUR_LABEL_NAME)" ] }, { "cell_type": "code", "execution_count": null, "id": "e6aca4a7", "metadata": { "editable": true, "slideshow": { "slide_type": "" }, "tags": [ "remove-cell" ] }, "outputs": [], "source": [ "from konfuzio_sdk.trainer.information_extraction import RFExtractionAI\n", "from konfuzio_sdk.tokenizer.base import ListTokenizer\n", "from konfuzio_sdk.tokenizer.regex import RegexTokenizer\n", "\n", "pipeline = RFExtractionAI()\n", "pipeline.tokenizer = ListTokenizer(tokenizers=[])\n", "pipeline.category = label.project.get_category_by_id(id_=63)\n", "train_doc_ids = {44823}\n", "project.get_document_by_id(44823).get_bbox()\n", "pipeline.documents = [doc for doc in pipeline.category.documents() if doc.id_ in train_doc_ids]\n", "for cur_label in pipeline.category.labels:\n", " for regex in cur_label.find_regex(category=pipeline.category):\n", " pipeline.tokenizer.tokenizers.append(RegexTokenizer(regex=regex))\n", "pipeline.test_documents = pipeline.category.test_documents()\n", "pipeline.df_train, pipeline.label_feature_list = pipeline.feature_function(\n", " documents=pipeline.documents, require_revised_annotations=False\n", ")\n", "pipeline.fit()\n", "predictions = []\n", "for doc in pipeline.documents:\n", " predicted_doc = pipeline.extract(document=doc)\n", " predictions.append(predicted_doc)\n", "GROUND_TRUTHS = pipeline.documents\n", "PREDICTIONS = predictions" ] }, { "cell_type": "markdown", "id": "2b63daef", "metadata": {}, "source": [ "Pass a list of ground-truth Documents and a list of their processed counterparts into the `EvaluationExtraction` class, then use `get_probable_outliers_by_confidence` with evaluation results as the input." ] }, { "cell_type": "code", "execution_count": null, "id": "9c9eca59", "metadata": { "editable": true, "slideshow": { "slide_type": "" }, "tags": [ "remove-cell" ] }, "outputs": [], "source": [ "CONFIDENCE = 0.9\n", "GROUND_TRUTH_DOCUMENTS = pipeline.documents\n", "PREDICTED_DOCUMENTS = predictions" ] }, { "cell_type": "code", "execution_count": null, "id": "43496355", "metadata": { "editable": true, "slideshow": { "slide_type": "" }, "tags": [ "remove-output" ] }, "outputs": [], "source": [ "from konfuzio_sdk.evaluate import ExtractionEvaluation\n", "\n", "evaluation = ExtractionEvaluation(documents=list(zip(GROUND_TRUTH_DOCUMENTS, PREDICTED_DOCUMENTS)), strict=False)\n", "outliers = label.get_probable_outliers_by_confidence(evaluation, confidence=CONFIDENCE)\n", "print([annotation.offset_string for annotation in outliers])" ] }, { "cell_type": "markdown", "id": "e42466b8", "metadata": {}, "source": [ "#### Outliers by normalization\n", "\n", "`Label.get_probable_outliers_by_normalization` looks for the Annotations that are unable to pass normalization by the data\n", "type of the given Label, meaning that they are not of the same data type themselves, thus outliers. For instance, if a Label with the data type \"Date\" is assigned to the line \"Example st. 1\", it will be returned by this method, because this line does not qualify as a date.\n", "\n", "Initialize the Project and the Label you want to assess, then run `get_probable_outliers_by_normalization` passing all Categories that are referring to the Label Set of a given Label as an input.\n", "" ] }, { "cell_type": "code", "execution_count": null, "id": "c43dcdfe", "metadata": { "editable": true, "slideshow": { "slide_type": "" } }, "outputs": [], "source": [ "from konfuzio_sdk.data import Project\n", "\n", "project = Project(id_=YOUR_PROJECT_ID)\n", "label = project.get_label_by_name(YOUR_LABEL_NAME)\n", "outliers = label.get_probable_outliers_by_normalization(project.categories)\n", "print([annotation.offset_string for annotation in outliers])" ] }, { "cell_type": "markdown", "id": "07396e1f", "metadata": {}, "source": [ "### Conclusion\n", "In this tutorial, we have walked through the essential steps for finding potential outliers amongst the Annotations. Below is the full code to accomplish this task.\n", "Note that you need to replace placeholders with respective values for the tutorial to run." ] }, { "cell_type": "code", "execution_count": null, "id": "bac8f825", "metadata": { "editable": true, "slideshow": { "slide_type": "" }, "tags": [ "skip-execution", "nbval-skip" ], "vscode": { "languageId": "plaintext" } }, "outputs": [], "source": [ "from konfuzio_sdk.data import Project\n", "from konfuzio_sdk.evaluate import ExtractionEvaluation\n", "\n", "project = Project(id_=YOUR_PROJECT_ID, strict_data_validation=False)\n", "\n", "label = project.get_label_by_name(YOUR_LABEL_NAME)\n", "\n", "# get outliers by regex\n", "outliers = label.get_probable_outliers_by_regex(project.categories, top_worst_percentage=TOP_WORST)\n", "\n", "# get outliers by confidence\n", "evaluation = ExtractionEvaluation(documents=list(zip(GROUND_TRUTH_DOCUMENTS, PREDICTED_DOCUMENTS)), strict=False)\n", "outliers = label.get_probable_outliers_by_confidence(evaluation, confidence=CONFIDENCE)\n", "\n", "# get outliers by normalization\n", "outliers = label.get_probable_outliers_by_normalization(project.categories)" ] }, { "cell_type": "markdown", "id": "c058953d", "metadata": {}, "source": [ "### What's next?\n", "\n", "- [Learn how to create regex-based Annotations](https://dev.konfuzio.com/sdk/tutorials/regex_based_annotations/index.html)\n", "- [Get to know how to create a custom Extraction AI](https://dev.konfuzio.com/sdk/tutorials/information_extraction/index.html#train-a-custom-date-extraction-ai)\n" ] } ], "metadata": { "kernelspec": { "display_name": "Python 3 (ipykernel)", "language": "python", "name": "python3" } }, "nbformat": 4, "nbformat_minor": 5 }