Merging Logic¶
Our extraction AI runs a merging logic at two steps in the extraction process. The first is a horizontal merging of Spans right after the Label classifier. This can be particularly useful when using the Whitespace tokenizer as it can find Spans containing spaces. The second merging logic is a vertical merging of Spans into a single multiline Annotation. Checkout the architecture diagram for more detail.
Horizontal Merge¶
When using an Extraction AI, we merge adjacent horizontal Spans right after the Label classifier.
A horizontal merging is valid only if:
All Spans have the same predicted Label
Confidence of predicted Label is above the Label threshold
All Spans are on the same line
No extraneous characters in between Spans
A maximum of 5 spaces in between Spans
The Label type is not one of the following: ‘Number’, ‘Positive Number’, ‘Percentage’, ‘Date’ OR the resulting merging create a Span normalizable to the same type
Input |
Able to merge? |
Reason |
Result |
---|---|---|---|
Text Annotation |
yes |
/ |
Text Annotation |
Text Annotation |
no |
Text Annotation |
|
Text . Annotation |
no |
Text . Annotation |
|
Annotation 7 |
no |
Annotation 7 |
|
34 98 |
no |
|
34 98 |
34 98 |
yes |
/ |
34 98 |
November 2022 |
yes |
/ |
November 2022 |
Novamber 2022 |
no |
|
Novamber 2022 |
34 98% |
yes |
/ |
34 98% |
34 98% |
no |
34 98% |
Label Type: Text
Label Type: Number
Label Type: Date
Label Type: Percentage
Label Type: NO LABEL/Below Label threshold
Vertical Merge¶
When using an Extraction AI, we join adjacent vertical Spans into a single Annotation after the LabelSet classifier.
A vertical merging is valid only if:
They are on the same Page
They are predicted to have the same Label
Multiline annotations with this Label exist in the training set
Consecutive vertical Spans either overlap in the x-axis, OR the preceding Span is at the end of the line, and following Span is at the beginning of the next
Confidence of predicted Label is above the Label threshold
Spans are on consecutive lines
Merged lower Span belongs to an Annotation in the default LabelSet, OR to an AnnotationSet with only a single Annotation
Input |
Able to merge? |
Reason |
---|---|---|
Text |
yes |
/ |
Annotation |
no |
|
Text more text |
no |
|
Some random text Text |
yes |
/ |
Some random text Text . |
no |
|
Text more text |
yes |
/ |
Text |
no |
|
Annotation Nb. |
yes |
* |
Annotation 41 |
no |
|
* The bottom Annotation is alone in its AnnotationSet and therefore can be merged.
** The Annotations on each line have been grouped into their own AnnotationSets and are not merged.
Label 1
Label 2
NO LABEL/Below Label threshold