Find possible outliers among the ground-truth Annotations¶
Prerequisites:
Data Layer concepts of Konfuzio: Annotation, Label, Document, Project
Regular expressions
Difficulty: Medium
Goal: Learn how to spot potentially wrong Annotations of a particular Label after your Documents have been processed via Information Extraction.
Environment¶
You need to install the Konfuzio SDK before diving into the tutorial.
To get up and running quickly, you can use our Colab Quick Start notebook.
As an alternative you can follow the installation section to install and initialize the Konfuzio SDK locally or on an environment of your choice.
Introduction¶
If you want to ensure that Annotations of a Label are consistent and check for possible outliers, you can use one of the Label
class’s methods. There are three of them available.
Outliers by regex¶
Label.get_outliers_by_regex
looks for the “worst” regexes used to find the Annotations under a given Label. “Worst” is determined by
the number of True Positives (correctly extracted Annotations) calculated when evaluating the regexes’ performance. The method returns Annotations predicted by the regexes with the least amount of True Positives. By default, the method returns Annotations retrieved by the regex that performs on the level of 10% in comparison to the best one.
Initialize the Project, select the Label you want to assess and run the method, passing all Categories that are referring to the Label Set of a given Label as an input. TOP_WORST is threshold for determining what percentage of the worst regexes’ output to return and can be also modified manually; by default it is 0.1.
from konfuzio_sdk.data import Project
project = Project(id_=YOUR_PROJECT_ID, update=True)
label = project.get_label_by_name(YOUR_LABEL_NAME)
outliers = label.get_probable_outliers_by_regex(project.categories, top_worst_percentage=TOP_WORST)
print([annotation.offset_string for annotation in outliers])
[['Deutsche Pfandbriefbank', 'DE47 7001 0500 0000 2XxXX XX'], ['Deutsche Bank PGK Nürnbe', 'DE73 7607 0024 0568 9745 11'], ['Deutsche Bank PGK Nürnbe', 'DE34 7607 0024 0569 8745 00'], ['Deutsche Bank PGK Nürnbe', 'DE73 7607 0024 0568 9745 11'], ['Deutsche Bank PGK Nürnbe', 'DE73 7607 0024 0568 9745 11'], ['Deutsche Bank PGK Nürnbe', 'DE33 7607 0024 0012 3456 78'], ['Deutsche Bank PGK Nürnbe', 'DE33 7607 0024 0012 3456 78'], ['UniCredit Bank-HypoVerei', 'DE10 7602 0070 0087 6543 21'], ['Sparkasse Nürnberg', 'DEO2 7605 0101 0010 2030 40'], ['LBS West Münster', 'DE12 4005 5555 1234 5XXX XX'], ['PSD Bank Nürnberg', 'DE38 7609 0900 0001 2XXX XX'], ['Commerzbank Nürnberg', 'DE94 7604 0061 0524 3712 00'], ['Deutsche Bank PGK Nürnbe', 'DE34 7607 0024 0569 8745 00'], ['Deutsche Bank PGK Nürnbe', 'DE33 7607 0024 0012 3456 78'], ['UniCredit Bank-HypoVerei', 'DE10 7602 0070 0087 6543 21'], ['Deutsche Bank PGK Nürnbe', 'DE33 7607 0024 0012 3456 78']]
Outliers by confidence¶
Label.get_probable_outliers_by_confidence
looks for the Annotations with the least confidence level, provided it is lower
than the specified threshold (the default threshold is 0.5). The method accepts an instance of EvaluationExtraction
class as an input and uses confidence predictions from there.
Initialize the Project and select the Label you want to assess.
from konfuzio_sdk.data import Project
project = Project(id_=YOUR_PROJECT_ID)
label = project.get_label_by_name(YOUR_LABEL_NAME)
Pass a list of ground-truth Documents and a list of their processed counterparts into the EvaluationExtraction
class, then use get_probable_outliers_by_confidence
with evaluation results as the input.
from konfuzio_sdk.evaluate import ExtractionEvaluation
evaluation = ExtractionEvaluation(documents=list(zip(GROUND_TRUTH_DOCUMENTS, PREDICTED_DOCUMENTS)), strict=False)
outliers = label.get_probable_outliers_by_confidence(evaluation, confidence=CONFIDENCE)
print([annotation.offset_string for annotation in outliers])
Outliers by normalization¶
Label.get_probable_outliers_by_normalization
looks for the Annotations that are unable to pass normalization by the data
type of the given Label, meaning that they are not of the same data type themselves, thus outliers. For instance, if a Label with the data type “Date” is assigned to the line “Example st. 1”, it will be returned by this method, because this line does not qualify as a date.
Initialize the Project and the Label you want to assess, then run get_probable_outliers_by_normalization
passing all Categories that are referring to the Label Set of a given Label as an input.
from konfuzio_sdk.data import Project
project = Project(id_=YOUR_PROJECT_ID)
label = project.get_label_by_name(YOUR_LABEL_NAME)
outliers = label.get_probable_outliers_by_normalization(project.categories)
print([annotation.offset_string for annotation in outliers])
[]
Conclusion¶
In this tutorial, we have walked through the essential steps for finding potential outliers amongst the Annotations. Below is the full code to accomplish this task. Note that you need to replace placeholders with respective values for the tutorial to run.
from konfuzio_sdk.data import Project
from konfuzio_sdk.evaluate import ExtractionEvaluation
project = Project(id_=YOUR_PROJECT_ID, strict_data_validation=False)
label = project.get_label_by_name(YOUR_LABEL_NAME)
# get outliers by regex
outliers = label.get_probable_outliers_by_regex(project.categories, top_worst_percentage=TOP_WORST)
# get outliers by confidence
evaluation = ExtractionEvaluation(documents=list(zip(GROUND_TRUTH_DOCUMENTS, PREDICTED_DOCUMENTS)), strict=False)
outliers = label.get_probable_outliers_by_confidence(evaluation, confidence=CONFIDENCE)
# get outliers by normalization
outliers = label.get_probable_outliers_by_normalization(project.categories)