Merging Logic

Our extraction AI runs a merging logic at two steps in the extraction process. The first is a horizontal merging of Spans right after the Label classifier. This can be particularly useful when using the Whitespace tokenizer as it can find Spans containing spaces. The second merging logic is a vertical merging of Spans into a single multiline Annotation. Checkout the architecture diagram for more detail.

Horizontal Merge

When using an Extraction AI, we merge adjacent horizontal Spans right after the Label classifier.

A horizontal merging is valid only if:

  1. All Spans have the same predicted Label

  2. Confidence of predicted Label is above the Label threshold

  3. All Spans are on the same line

  4. No extraneous characters in between Spans

  5. A maximum of 5 spaces in between Spans

  6. The Label type is not one of the following: ‘Number’, ‘Positive Number’, ‘Percentage’, ‘Date’ OR the resulting merging create a Span normalizable to the same type

Input

Able to merge?

Reason

Result

Text  Annotation

yes

/

Text  Annotation

Text      Annotation

no

Text      Annotation

Text . Annotation

no

Text . Annotation

Annotation 7

no

Annotation 7

34    98

no

  1. (see here)

34    98

34  98

yes

/

34  98

November 2022

yes

/

November 2022

Novamber 2022

no

  1. (see here)

Novamber 2022

34  98%

yes

/

34  98%

34  98%

no

34  98%

Label Type: Text
Label Type: Number
Label Type: Date
Label Type: Percentage
Label Type: NO LABEL/Below Label threshold

Vertical Merge

When using an Extraction AI, we join adjacent vertical Spans into a single Annotation after the LabelSet classifier.

A vertical merging is valid only if:

  1. They are on the same Page

  2. They are predicted to have the same Label

  3. Multiline annotations with this Label exist in the training set

  4. Consecutive vertical Spans either overlap in the x-axis, OR the preceding Span is at the end of the line, and following Span is at the beginning of the next

  5. Confidence of predicted Label is above the Label threshold

  6. Spans are on consecutive lines

  7. Merged lower Span belongs to an Annotation in the default LabelSet, OR to an AnnotationSet with only a single Annotation

Input

Able to merge?

Reason

Text
Annotation

yes

/

Annotation
42

no

Text more text
          Annotation

no

Some random text Text
Annotation

yes

/

Some random text Text  .
Annotation

no

Text more text
    Annotation
        42

yes

/

Text 
        more text    
   Annotation

no

Annotation  Nb.
                      42

yes

*

Annotation  41
Annotation  42

no

  1. **

* The bottom Annotation is alone in its AnnotationSet and therefore can be merged.
** The Annotations on each line have been grouped into their own AnnotationSets and are not merged.

Label 1
Label 2
NO LABEL/Below Label threshold