Find possible outliers among the ground-truth Annotations


Prerequisites:

  • Data Layer concepts of Konfuzio: Annotation, Label, Document, Project

  • Regular expressions

Difficulty: Medium

Goal: Learn how to spot potentially wrong Annotations of a particular Label after your Documents have been processed via Information Extraction.


Environment

You need to install the Konfuzio SDK before diving into the tutorial.
To get up and running quickly, you can use our Colab Quick Start notebook.
Open In Colab

As an alternative you can follow the installation section to install and initialize the Konfuzio SDK locally or on an environment of your choice.

Introduction

If you want to ensure that Annotations of a Label are consistent and check for possible outliers, you can use one of the Label class’s methods. There are three of them available.

Outliers by regex

Label.get_outliers_by_regex looks for the “worst” regexes used to find the Annotations under a given Label. “Worst” is determined by the number of True Positives (correctly extracted Annotations) calculated when evaluating the regexes’ performance. The method returns Annotations predicted by the regexes with the least amount of True Positives. By default, the method returns Annotations retrieved by the regex that performs on the level of 10% in comparison to the best one.

Initialize the Project, select the Label you want to assess and run the method, passing all Categories that are referring to the Label Set of a given Label as an input. TOP_WORST is threshold for determining what percentage of the worst regexes’ output to return and can be also modified manually; by default it is 0.1.

from konfuzio_sdk.data import Project

project = Project(id_=YOUR_PROJECT_ID, update=True)
label = project.get_label_by_name(YOUR_LABEL_NAME)
outliers = label.get_probable_outliers_by_regex(project.categories, top_worst_percentage=TOP_WORST)
print([annotation.offset_string for annotation in outliers])
[['Deutsche Pfandbriefbank', 'DE47 7001 0500 0000 2XxXX XX'], ['Deutsche Bank PGK  Nürnbe', 'DE73 7607 0024 0568 9745 11'], ['Deutsche Bank PGK  Nürnbe', 'DE34 7607 0024 0569 8745 00'], ['Deutsche Bank PGK  Nürnbe', 'DE73 7607 0024 0568 9745 11'], ['Deutsche Bank PGK  Nürnbe', 'DE73 7607 0024 0568 9745 11'], ['Deutsche Bank PGK  Nürnbe', 'DE33 7607 0024 0012 3456 78'], ['Deutsche Bank PGK  Nürnbe', 'DE33 7607 0024 0012 3456 78'], ['UniCredit Bank-HypoVerei', 'DE10 7602 0070 0087 6543 21'], ['Sparkasse Nürnberg', 'DEO2 7605 0101 0010 2030 40'], ['LBS West Münster', 'DE12 4005 5555 1234 5XXX XX'], ['PSD Bank Nürnberg', 'DE38 7609 0900 0001 2XXX XX'], ['Commerzbank Nürnberg', 'DE94 7604 0061 0524 3712 00'], ['Deutsche Bank PGK  Nürnbe', 'DE34 7607 0024 0569 8745 00'], ['Deutsche Bank PGK  Nürnbe', 'DE33 7607 0024 0012 3456 78'], ['UniCredit Bank-HypoVerei', 'DE10 7602 0070 0087 6543 21'], ['Deutsche Bank PGK  Nürnbe', 'DE33 7607 0024 0012 3456 78']]

Outliers by confidence

Label.get_probable_outliers_by_confidence looks for the Annotations with the least confidence level, provided it is lower than the specified threshold (the default threshold is 0.5). The method accepts an instance of EvaluationExtraction class as an input and uses confidence predictions from there.

Initialize the Project and select the Label you want to assess.

from konfuzio_sdk.data import Project

project = Project(id_=YOUR_PROJECT_ID)
label = project.get_label_by_name(YOUR_LABEL_NAME)

Pass a list of ground-truth Documents and a list of their processed counterparts into the EvaluationExtraction class, then use get_probable_outliers_by_confidence with evaluation results as the input.

from konfuzio_sdk.evaluate import ExtractionEvaluation

evaluation = ExtractionEvaluation(documents=list(zip(GROUND_TRUTH_DOCUMENTS, PREDICTED_DOCUMENTS)), strict=False)
outliers = label.get_probable_outliers_by_confidence(evaluation, confidence=CONFIDENCE)
print([annotation.offset_string for annotation in outliers])

Outliers by normalization

Label.get_probable_outliers_by_normalization looks for the Annotations that are unable to pass normalization by the data type of the given Label, meaning that they are not of the same data type themselves, thus outliers. For instance, if a Label with the data type “Date” is assigned to the line “Example st. 1”, it will be returned by this method, because this line does not qualify as a date.

Initialize the Project and the Label you want to assess, then run get_probable_outliers_by_normalization passing all Categories that are referring to the Label Set of a given Label as an input.

from konfuzio_sdk.data import Project

project = Project(id_=YOUR_PROJECT_ID)
label = project.get_label_by_name(YOUR_LABEL_NAME)
outliers = label.get_probable_outliers_by_normalization(project.categories)
print([annotation.offset_string for annotation in outliers])
[]

Conclusion

In this tutorial, we have walked through the essential steps for finding potential outliers amongst the Annotations. Below is the full code to accomplish this task. Note that you need to replace placeholders with respective values for the tutorial to run.

from konfuzio_sdk.data import Project
from konfuzio_sdk.evaluate import ExtractionEvaluation

project = Project(id_=YOUR_PROJECT_ID, strict_data_validation=False)

label = project.get_label_by_name(YOUR_LABEL_NAME)

# get outliers by regex
outliers = label.get_probable_outliers_by_regex(project.categories, top_worst_percentage=TOP_WORST)

# get outliers by confidence
evaluation = ExtractionEvaluation(documents=list(zip(GROUND_TRUTH_DOCUMENTS, PREDICTED_DOCUMENTS)), strict=False)
outliers = label.get_probable_outliers_by_confidence(evaluation, confidence=CONFIDENCE)

# get outliers by normalization
outliers = label.get_probable_outliers_by_normalization(project.categories)

What’s next?