Categorize a Document using Categorization AI


Prerequisites:

  • Data Layer concepts of Konfuzio SDK: Document, Category, Project, Page

  • AI concepts of Konfuzio SDK: Extraction

  • Understanding of ML concepts: train-validation loop, optimizer, epochs

Difficulty: Medium

Goal: Learn how to categorize a Document using one of Categorization AIs pre-constructed by Konfuzio


Environment

You need to install the Konfuzio SDK before diving into the tutorial.
To get up and running quickly, you can use our Colab Quick Start notebook.
Open In Colab

As an alternative you can follow the installation section to install and initialize the Konfuzio SDK locally or on an environment of your choice.

Introduction

To categorize a Document with a Categorization AI constructed by Konfuzio, there are two main options: the Name-based Categorization AI and the more complex Model-based Categorization AI.

Name-based Categorization AI

The name-based Categorization AI is a simple logic that checks if a name of the Category appears in the Document. It can be used to categorize Documents when no model-based Categorization AI is available.

Let’s begin with making imports, initializing the Categorization model and calling the Document to categorize.

from konfuzio_sdk.data import Project
from konfuzio_sdk.trainer.document_categorization import NameBasedCategorizationAI

project = Project(id_=YOUR_PROJECT_ID)
categorization_model = NameBasedCategorizationAI(project.categories)
test_document = project.get_document_by_id(YOUR_DOCUMENT_ID)

Then, we categorize the Document. The Categorization Model returns a copy of the SDK Document with Category attribute (use inplace=True to maintain the original Document instead). If the input Document is already categorized, the already present Category is used (use recategorize=True if you want to force a recategorization). Each Page is categorized individually.

result_doc = categorization_model.categorize(document=test_document)

for page in result_doc.pages():
    assert page.category == project.categories[0]
    print(f"Found category {page.category} for {page}")
Found category Category: Lohnabrechnung (63) for Page 0 in Virtual Document None (44865)

The Category of the Document is defined when all pages’ Categories are equal. If the Document contains mixed Categories, only the Page level Category will be defined, and the Document level Category will be NO_CATEGORY.

print(f"Found category {result_doc.category} for {result_doc}")
Found category Category: Lohnabrechnung (63) for Virtual Document None (44865)

Model-based Categorization AI

For better results you can build, train and test a Categorization AI using Image Models and Text Models to classify the image and text of each Page.

Let’s start with the imports and initializing the Project.

from konfuzio_sdk.data import Project, Document
from konfuzio_sdk.trainer.document_categorization import build_categorization_ai_pipeline
from konfuzio_sdk.trainer.document_categorization import ImageModel, TextModel, CategorizationAI

project = Project(id_=YOUR_PROJECT_ID)

Build the Categorization AI architecture using a template of pre-built Image and Text classification Models. In this tutorial, we use EfficientNetB0 and NBOWSelfAttention together.

categorization_pipeline = build_categorization_ai_pipeline(
    categories=project.categories,
    documents=project.documents,
    test_documents=project.test_documents,
    image_model=ImageModel.EfficientNetB0,
    text_model=TextModel.NBOWSelfAttention,
)

Train and evaluate the AI. You can specify parameters for training, for example, number of epochs and an optimizer.

categorization_pipeline.fit(n_epochs=1, optimizer={'name': 'Adam'})
data_quality = categorization_pipeline.evaluate(use_training_docs=True)
ai_quality = categorization_pipeline.evaluate()
assert data_quality.f1(None) == 1.0
assert ai_quality.f1(None) == 1.0

Categorize a Document using the newly trained model.

document = project.get_document_by_id(YOUR_DOCUMENT_ID)
categorization_result = categorization_pipeline.categorize(document=document)
assert isinstance(categorization_result, Document)
for page in categorization_result.pages():
    print(f"Found category {page.category} for {page}")
Found category Category: Lohnabrechnung (63) for Page 0 in Virtual Document None (44865)

Save the model and check that it can be loaded after that to ensure it could be uploaded to the Konfuzio app or to an on-prem installation.

pickle_ai_path = categorization_pipeline.save()
categorization_pipeline = CategorizationAI.load_model(pickle_ai_path)

To prepare the data for training and testing your AI, follow the data preparation tutorial.

For a list of available Models see all the available Categorization Models below.

Categorization AI Models

When using build_categorization_ai_pipeline, you can select which Image Module and/or Text Module to use for classification. At least one between the Image Model or the Text Model must be specified. Both can also be used at the same time.

The list of available Categorization Models is implemented as an Enum containing the following elements:

from konfuzio_sdk.trainer.document_categorization import ImageModel, TextModel

# Image Models
ImageModel.VGG11
ImageModel.VGG13
ImageModel.VGG16
ImageModel.VGG19
ImageModel.EfficientNetB0
ImageModel.EfficientNetB1
ImageModel.EfficientNetB2
ImageModel.EfficientNetB3
ImageModel.EfficientNetB4
ImageModel.EfficientNetB5
ImageModel.EfficientNetB6
ImageModel.EfficientNetB7
ImageModel.EfficientNetB8

# Text Models
TextModel.NBOW
TextModel.NBOWSelfAttention
TextModel.LSTM
TextModel.BERT

See more details about these Categorization Models under API Reference - Categorization AI.

Possible configurations

The following configurations of Categorization AI are tested. Tokenizer, text and image processors can be specified when building Categorization pipeline locally; text and image processors can be specified when building the pipeline either locally or on app/on-prem installation.

Each line stands for a single configuration. If a field is None, it requires specifying it as None; otherwise, the default value will be applied.

You can find more information on how to use these configurations, what are default values and where to specify them here.

Tokenizer

Text model class

Text model name

Image model class

Image model name

WhitespaceTokenizer

NBOWSelfAttention

nbowselfattention

EfficientNet

efficientnet_b0

WhitespaceTokenizer

NBOWSelfAttention

nbowselfattention

EfficientNet

efficientnet_b3

WhitespaceTokenizer

NBOW

nbow

VGG

vgg11

WhitespaceTokenizer

LSTM

lstm

VGG

vgg13

ConnectedTextTokenizer

NBOW

nbow

VGG

vgg11

ConnectedTextTokenizer

LSTM

lstm

VGG

vgg13

None

None

None

EfficientNet

efficientnet_b0

None

None

None

EfficientNet

efficientnet_b3

None

None

None

VGG

vgg11

None

None

None

VGG

vgg13

None

None

None

VGG

vgg16

None

None

None

VGG

vgg19

TransformersTokenizer

BERT *

bert-base-german-cased

None

None

*Note: In this table, we list a single BERT-based model (bert-base-german-cased). The following table lists the possible values for text model versions that can be passed as name argument when configuring BERT model for Categorization.

Models compatible with BERT class

Name

Embeddings dimension

Language

Number of parameters

bert-base-uncased

768

English

110 million

distilbert-base-uncased

768

English

66 million

google/mobilebert-uncased

512

English

25 million

albert-base-v2

768

English

12 million

german-nlp-group/electra-base-german-uncased

768

German

111 million

bert-base-german-cased

768

German

110 million

bert-base-german-uncased

768

German

110 million

distilbert-base-german-cased

768

German

66 million

bert-base-multilingual-cased

768

Multiple

110 million

Note: This list is not exhaustive. We only list the models that are fully tested. However, you can use the Huggingface hub to find other models that best suit your needs. To ensure a model is compatible with Categorization AI, initialize it with the SDK’s TransformersTokenizer class as presented in an example below, replacing the value of name to the name of your model of choice. If a model is compatible, the initialization will be successful; otherwise, an error about incompatibility will appear.

from konfuzio_sdk.trainer.tokenization import TransformersTokenizer

tokenizer = TransformersTokenizer(name='bert-base-chinese')

Configurable parameters

Every group of models/configuration you decide to use has manually configurable parameters. Follow this section to find out what parameters are configurable and which models accept them.

Some of the parameters are universally accepted for training regardless of the model.

  • n_epochs - number of times the entire training dataset is passed through the model during training. BERT models require lower values like 3-5, other models can require higher number, like 20+. Default value is 20.

  • patience - number of epochs to wait before early stopping if the model’s performance on the validation set does not improve. Default value is 3.

  • optimizer - algorithm used to update the model’s parameters during training. Default value is AdamW with learning rate of 1e-4.

  • lr_decay - rate at which the learning rate is reduced over time during training to help the model maximize training efficiency. Default value is 0.999.

Other parameters are configurable only for some of the models and might not have a unified default value.

  • input_dim - dimensionality of the input data, which represents the number of features or variables in the input.

  • dropout_rate - fraction of the input units to randomly set to 0 during training to prevent overfitting.

  • emb_dim - dimensionality of the embeddings (vector representation). Default value is 64.

  • n_heads - number of attention heads in multi-head attention mechanisms which enable the model to attend to different parts of the input simultaneously. Note that n_heads must be a factor of emb_dim, i.e. emb_dim % n_heads == 0.

  • hid_dim - dimensionality of the hidden states in the model. Default value is 256.

  • n_layers - number of layers in the model. Default value is 2.

  • bidirectional - whether to use bidirectional processing in LSTM, enabling the model to consider both past and future context. Default value is True.

  • name - a name or identifier for the model.

  • freeze - whether to freeze the weights of certain layers or parameters during training, preventing them from being updated.

Model

input_dim

dropout_rate

emb_dim

n_heads

hid_dim

n_layers

bidirectional

name

freeze

NBOW

NBOWSelfAttention

LSTM

BERT

VGG

EfficientNet

Conclusion

In this tutorial, we presented two different ways to categorize a Document using AIs constructed by Konfuzio and provided possible configurations that can be used in model-based Categorization.

What’s next?