Categorize a Document using Categorization AI¶
Prerequisites:
Data Layer concepts of Konfuzio SDK: Document, Category, Project, Page
AI concepts of Konfuzio SDK: Extraction
Understanding of ML concepts: train-validation loop, optimizer, epochs
Difficulty: Medium
Goal: Learn how to categorize a Document using one of Categorization AIs pre-constructed by Konfuzio
Environment¶
You need to install the Konfuzio SDK before diving into the tutorial.
To get up and running quickly, you can use our Colab Quick Start notebook.
As an alternative you can follow the installation section to install and initialize the Konfuzio SDK locally or on an environment of your choice.
Introduction¶
To categorize a Document with a Categorization AI constructed by Konfuzio, there are two main options: the Name-based Categorization AI and the more complex Model-based Categorization AI.
Name-based Categorization AI¶
The name-based Categorization AI is a simple logic that checks if a name of the Category appears in the Document. It can be used to categorize Documents when no model-based Categorization AI is available.
Let’s begin with making imports, initializing the Categorization model and calling the Document to categorize.
from konfuzio_sdk.data import Project
from konfuzio_sdk.trainer.document_categorization import NameBasedCategorizationAI
project = Project(id_=YOUR_PROJECT_ID)
categorization_model = NameBasedCategorizationAI(project.categories)
test_document = project.get_document_by_id(YOUR_DOCUMENT_ID)
Then, we categorize the Document. The Categorization Model returns a copy of the SDK Document with Category attribute (use inplace=True to maintain the original Document instead). If the input Document is already categorized, the already present Category is used (use recategorize=True if you want to force a recategorization). Each Page is categorized individually.
result_doc = categorization_model.categorize(document=test_document)
for page in result_doc.pages():
assert page.category == project.categories[0]
print(f"Found category {page.category} for {page}")
The Category of the Document is defined when all pages’ Categories are equal. If the Document contains mixed Categories, only the Page level Category will be defined, and the Document level Category will be NO_CATEGORY.
print(f"Found category {result_doc.category} for {result_doc}")
Model-based Categorization AI¶
For better results you can build, train and test a Categorization AI using Image Models and Text Models to classify the image and text of each Page.
Let’s start with the imports and initializing the Project.
from konfuzio_sdk.data import Project, Document
from konfuzio_sdk.trainer.document_categorization import build_categorization_ai_pipeline
from konfuzio_sdk.trainer.document_categorization import ImageModel, TextModel, CategorizationAI
project = Project(id_=YOUR_PROJECT_ID)
Build the Categorization AI architecture using a template of pre-built Image and Text classification Models. In this tutorial, we use EfficientNetB0
and NBOWSelfAttention
together.
categorization_pipeline = build_categorization_ai_pipeline(
categories=project.categories,
documents=project.documents,
test_documents=project.test_documents,
image_model=ImageModel.EfficientNetB0,
text_model=TextModel.NBOWSelfAttention,
)
Train and evaluate the AI. You can specify parameters for training, for example, number of epochs and an optimizer.
categorization_pipeline.fit(n_epochs=1, optimizer={'name': 'Adam'})
data_quality = categorization_pipeline.evaluate(use_training_docs=True)
ai_quality = categorization_pipeline.evaluate()
assert data_quality.f1(None) == 1.0
assert ai_quality.f1(None) == 1.0
Categorize a Document using the newly trained model.
document = project.get_document_by_id(YOUR_DOCUMENT_ID)
categorization_result = categorization_pipeline.categorize(document=document)
assert isinstance(categorization_result, Document)
for page in categorization_result.pages():
print(f"Found category {page.category} for {page}")
Save the model and check that it can be loaded after that to ensure it could be uploaded to the Konfuzio app or to an on-prem installation.
pickle_ai_path = categorization_pipeline.save()
categorization_pipeline = CategorizationAI.load_model(pickle_ai_path)
To prepare the data for training and testing your AI, follow the data preparation tutorial.
For a list of available Models see all the available Categorization Models below.
Categorization AI Models¶
When using build_categorization_ai_pipeline
, you can select which Image Module and/or Text Module to use for
classification. At least one between the Image Model or the Text Model must be specified. Both can also be used
at the same time.
The list of available Categorization Models is implemented as an Enum containing the following elements:
from konfuzio_sdk.trainer.document_categorization import ImageModel, TextModel
# Image Models
ImageModel.VGG11
ImageModel.VGG13
ImageModel.VGG16
ImageModel.VGG19
ImageModel.EfficientNetB0
ImageModel.EfficientNetB1
ImageModel.EfficientNetB2
ImageModel.EfficientNetB3
ImageModel.EfficientNetB4
ImageModel.EfficientNetB5
ImageModel.EfficientNetB6
ImageModel.EfficientNetB7
ImageModel.EfficientNetB8
# Text Models
TextModel.NBOW
TextModel.NBOWSelfAttention
TextModel.LSTM
TextModel.BERT
See more details about these Categorization Models under API Reference - Categorization AI.
Possible configurations¶
The following configurations of Categorization AI are tested. Tokenizer, text and image processors can be specified when building Categorization pipeline locally; text and image processors can be specified when building the pipeline either locally or on app/on-prem installation.
Each line stands for a single configuration. If a field is None, it requires specifying it as None; otherwise, the default value will be applied.
You can find more information on how to use these configurations, what are default values and where to specify them here.
Tokenizer |
Text model class |
Text model name |
Image model class |
Image model name |
---|---|---|---|---|
WhitespaceTokenizer |
NBOWSelfAttention |
|
EfficientNet |
|
WhitespaceTokenizer |
NBOWSelfAttention |
|
EfficientNet |
|
WhitespaceTokenizer |
NBOW |
|
VGG |
|
WhitespaceTokenizer |
LSTM |
|
VGG |
|
ConnectedTextTokenizer |
NBOW |
|
VGG |
|
ConnectedTextTokenizer |
LSTM |
|
VGG |
|
None |
None |
None |
EfficientNet |
|
None |
None |
None |
EfficientNet |
|
None |
None |
None |
VGG |
|
None |
None |
None |
VGG |
|
None |
None |
None |
VGG |
|
None |
None |
None |
VGG |
|
TransformersTokenizer |
BERT * |
|
None |
None |
*Note: In this table, we list a single BERT-based model (bert-base-german-cased
). The following table lists the
possible values for text model versions that can be passed as name
argument when configuring BERT model for
Categorization.
Models compatible with BERT class¶
Name |
Embeddings dimension |
Language |
Number of parameters |
---|---|---|---|
|
768 |
English |
110 million |
|
768 |
English |
66 million |
|
512 |
English |
25 million |
|
768 |
English |
12 million |
|
768 |
German |
111 million |
|
768 |
German |
110 million |
|
768 |
German |
110 million |
|
768 |
German |
66 million |
|
768 |
Multiple |
110 million |
Note: This list is not exhaustive. We only list the models that are fully tested. However, you can use the
Huggingface hub to find other models that best suit your needs. To ensure a model is
compatible with Categorization AI, initialize it with the SDK’s TransformersTokenizer
class as presented in an example
below, replacing the value of name
to the name of your model of choice. If a model is compatible, the initialization
will be successful; otherwise, an error about incompatibility will appear.
from konfuzio_sdk.trainer.tokenization import TransformersTokenizer
tokenizer = TransformersTokenizer(name='bert-base-chinese')
Configurable parameters¶
Every group of models/configuration you decide to use has manually configurable parameters. Follow this section to find out what parameters are configurable and which models accept them.
Some of the parameters are universally accepted for training regardless of the model.
n_epochs
- number of times the entire training dataset is passed through the model during training. BERT models require lower values like 3-5, other models can require higher number, like 20+. Default value is 20.patience
- number of epochs to wait before early stopping if the model’s performance on the validation set does not improve. Default value is 3.optimizer
- algorithm used to update the model’s parameters during training. Default value isAdamW
with learning rate of1e-4
.lr_decay
- rate at which the learning rate is reduced over time during training to help the model maximize training efficiency. Default value is 0.999.
Other parameters are configurable only for some of the models and might not have a unified default value.
input_dim
- dimensionality of the input data, which represents the number of features or variables in the input.dropout_rate
- fraction of the input units to randomly set to 0 during training to prevent overfitting.emb_dim
- dimensionality of the embeddings (vector representation). Default value is 64.n_heads
- number of attention heads in multi-head attention mechanisms which enable the model to attend to different parts of the input simultaneously. Note thatn_heads
must be a factor ofemb_dim
, i.e.emb_dim % n_heads == 0
.hid_dim
- dimensionality of the hidden states in the model. Default value is 256.n_layers
- number of layers in the model. Default value is 2.bidirectional
- whether to use bidirectional processing in LSTM, enabling the model to consider both past and future context. Default value is True.name
- a name or identifier for the model.freeze
- whether to freeze the weights of certain layers or parameters during training, preventing them from being updated.
Model |
|
|
|
|
|
|
|
|
|
---|---|---|---|---|---|---|---|---|---|
NBOW |
✔ |
✔ |
✔ |
✘ |
✘ |
✘ |
✘ |
✘ |
✘ |
NBOWSelfAttention |
✔ |
✘ |
✔ |
✔ |
✘ |
✘ |
✘ |
✘ |
✘ |
LSTM |
✔ |
✔ |
✔ |
✘ |
✔ |
✔ |
✔ |
✘ |
✘ |
BERT |
✘ |
✘ |
✘ |
✘ |
✘ |
✘ |
✘ |
✔ |
✔ |
VGG |
✘ |
✘ |
✘ |
✘ |
✘ |
✘ |
✘ |
✔ |
✔ |
EfficientNet |
✘ |
✘ |
✘ |
✘ |
✘ |
✘ |
✘ |
✔ |
✔ |
Conclusion¶
In this tutorial, we presented two different ways to categorize a Document using AIs constructed by Konfuzio and provided possible configurations that can be used in model-based Categorization.