Document Categorization¶
When uploading a Document to Konfuzio, the first step is to assign it to a Category. This can be done manually, or automatically using a Categorization AI. Categorization always happens on a Project level.
Setting the Category of a Document and its individual Pages Manually¶
You can initialize a Document with a Category. You can also use Document.set_category
to set
a Document’s Category after it has been initialized. This will count as if a human manually revised it.
project = Project(id_=YOUR_PROJECT_ID)
my_category = project.get_category_by_id(YOUR_CATEGORY_ID)
my_document = Document(text="My text.", project=project, category=my_category)
assert my_document.category == my_category
my_document.set_category(my_category)
assert my_document.category_is_revised is True
If a Document is initialized with no Category, it will automatically be set to NO_CATEGORY. Another Category can be manually set later.
document = project.get_document_by_id(YOUR_DOCUMENT_ID)
document.set_category(None)
assert document.category == project.no_category
document.set_category(my_category)
assert document.category == my_category
assert document.category_is_revised is True
# This will set it for all of its Pages as well.
for page in document.pages():
assert page.category == my_category
If you use a Categorization AI to automatically assign a Category to a Document (such as the NameBasedCategorizationAI, each Page will be assigned a Category Annotation with predicted confidence information, and the following properties will be accessible. You can also find these documented under API Reference - Document, API Reference - Page and API Reference - Category Annotation.
Property |
Description |
---|---|
|
The AI predicted Category of this Category |
|
The AI predicted confidence of this Category |
|
List of predicted Category Annotations at the |
|
Get the maximum confidence predicted Category |
|
Get the maximum confidence predicted Category |
|
Returns a Category only if all Pages have same |
|
List of predicted Category Annotations at the |
|
Get the maximum confidence predicted Category |
|
Get the maximum confidence predicted Category |
To categorize a Document with a Categorization AI, we have two main options: the Name-based Categorization AI and the more complex Model-based Categorization AI.
Name-based Categorization AI¶
The name-based Categorization AI is a good fallback logic using the name of the Category to categorize Documents when no model-based Categorization AI is available:
from konfuzio_sdk.data import Project
from konfuzio_sdk.trainer.document_categorization import NameBasedCategorizationAI
# Set up your Project.
project = Project(id_=YOUR_PROJECT_ID)
# Initialize the Categorization Model.
categorization_model = NameBasedCategorizationAI(project.categories)
# Retrieve a Document to categorize.
test_document = project.get_document_by_id(YOUR_DOCUMENT_ID)
# The Categorization Model returns a copy of the SDK Document with Category attribute
# (use inplace=True to maintain the original Document instead).
# If the input Document is already categorized, the already present Category is used
# (use recategorize=True if you want to force a recategorization).
result_doc = categorization_model.categorize(document=test_document)
# Each Page is categorized individually.
for page in result_doc.pages():
assert page.category == project.categories[0]
print(f"Found category {page.category} for {page}")
# The Category of the Document is defined when all pages' Categories are equal.
# If the Document contains mixed Categories, only the Page level Category will be defined,
# and the Document level Category will be NO_CATEGORY.
print(f"Found category {result_doc.category} for {result_doc}")
Model-based Categorization AI¶
For better results you can build, train and test a Categorization AI using Image Models and Text Models to classify the image and text of each Page:
from konfuzio_sdk.data import Project, Document
from konfuzio_sdk.trainer.document_categorization import build_categorization_ai_pipeline
from konfuzio_sdk.trainer.document_categorization import ImageModel, TextModel, CategorizationAI
# Set up your Project.
project = Project(id_=YOUR_PROJECT_ID)
# Build the Categorization AI architecture using a template
# of pre-built Image and Text classification Models.
categorization_pipeline = build_categorization_ai_pipeline(
categories=project.categories,
documents=project.documents,
test_documents=project.test_documents,
image_model=ImageModel.EfficientNetB0,
text_model=TextModel.NBOWSelfAttention,
)
# Train the AI.
categorization_pipeline.fit(n_epochs=1, optimizer={'name': 'Adam'})
# Evaluate the AI
data_quality = categorization_pipeline.evaluate(use_training_docs=True)
ai_quality = categorization_pipeline.evaluate()
assert data_quality.f1(None) == 1.0
assert ai_quality.f1(None) == 1.0
# Categorize a Document
document = project.get_document_by_id(YOUR_DOCUMENT_ID)
categorization_result = categorization_pipeline.categorize(document=document)
assert isinstance(categorization_result, Document)
for page in categorization_result.pages():
print(f"Found category {page.category} for {page}")
# Save and load a pickle file for the AI
pickle_ai_path = categorization_pipeline.save()
categorization_pipeline = CategorizationAI.load_model(pickle_ai_path)
To prepare the data for training and testing your AI, follow the data preparation tutorial.
For a list of available Models see all the available Categorization Models below.
Categorization AI Models¶
When using build_categorization_ai_pipeline
, you can select which Image Module and/or Text Module to use for
classification. At least one between the Image Model or the Text Model must be specified. Both can also be used
at the same time.
The list of available Categorization Models is implemented as an Enum containing the following elements:
from konfuzio_sdk.trainer.document_categorization import ImageModel, TextModel
# Image Models
ImageModel.VGG11
ImageModel.VGG13
ImageModel.VGG16
ImageModel.VGG19
ImageModel.EfficientNetB0
ImageModel.EfficientNetB1
ImageModel.EfficientNetB2
ImageModel.EfficientNetB3
ImageModel.EfficientNetB4
ImageModel.EfficientNetB5
ImageModel.EfficientNetB6
ImageModel.EfficientNetB7
ImageModel.EfficientNetB8
# Text Models
TextModel.NBOW
TextModel.NBOWSelfAttention
TextModel.LSTM
TextModel.BERT
See more details about these Categorization Models under API Reference - Categorization AI.
Create a custom Categorization AI¶
This section explains how to train a custom Categorization AI locally, how to save it and upload it to the Konfuzio Server. If you run this tutorial in Colab and experience any version compatibility issues when working with the SDK, restart the runtime and initialize the SDK once again; this will resolve the issue.
Note: you don’t necessarily need to create the AI from scratch if you already have some document-processing architecture. You just need to wrap it into the class that corresponds to our Categorization AI structure. Follow the steps in this tutorial to find out what are the requirements for that.
Note: currently, the Server supports AI models created using torch<2.0.0
.
By default, any Categorization AI class should derive from the
AbstractCategorizationModel
class and implement the following methods:
from konfuzio_sdk.trainer.document_categorization import AbstractCategorizationAI
from konfuzio_sdk.data import Page, Category
from typing import List
class CustomCategorizationAI(AbstractCategorizationAI):
def __init__(self, categories: List[Category], *args, **kwargs):
# a list of Categories between which the AI will differentiate
super().__init__(categories)
pass
# initialize key variables required by the custom AI:
# for instance, self.documents and self.test_documents to train and test the AI on, self.categories to determine
# which Categories will the AI be able to predict
def fit(self):
pass
# Define architecture and training that the model undergoes, i.e. a NN architecture or a custom hardcoded logic
# for instance:
#
# self.classifier_iterator = build_document_classifier_iterator(
# self.documents,
# self.train_transforms,
# use_image = True,
# use_text = False,
# device='cpu',
# )
# self.classifier._fit_classifier(self.classifier_iterator, **kwargs)
#
# This method does not return anything; rather, it modifies the self.model if you provide this attribute.
#
# This method is allowed to be implemented as a no-op if you provide the trained model in other ways
def _categorize_page(self, page: Page) -> Page:
pass
# define how the model assigns a Category to a Page.
# for instance:
#
# predicted_category_id, predicted_confidence = self._predict(page_image)
#
# for category in self.categories:
# if category.id_ == predicted_category_id:
# _ = CategoryAnnotation(category=category, confidence=predicted_confidence, page=page)
#
# **NB:** The result of extraction must be the input Page with added Categorization attribute `Page.category`
def save(self, path: str):
pass
# define how to save a model in a .pt format – for example, in a way it's defined in the CategorizationAI
#
# data_to_save = {
# 'tokenizer': self.tokenizer,
# 'image_preprocessing': self.image_preprocessing,
# 'image_augmentation': self.image_augmentation,
# 'text_vocab': self.text_vocab,
# 'category_vocab': self.category_vocab,
# 'classifier': self.classifier,
# 'eval_transforms': self.eval_transforms,
# 'train_transforms': self.train_transforms,
# 'model_type': 'CategorizationAI',
# }
# torch.save(data_to_save, path)
Example usage of your Custom Categorization AI:
import os
from konfuzio_sdk.data import Project
from konfuzio_sdk.trainer.document_categorization import (
CategorizationAI,
EfficientNet,
PageImageCategorizationModel,
)
# Initialize Project and provide the AI training and test data
project = Project(id_=YOUR_PROJECT_ID) # see https://dev.konfuzio.com/sdk/get_started.html#example-usage
categorization_pipeline = CategorizationAI(project.categories)
categorization_pipeline.categories = project.categories
categorization_pipeline.documents = [
document for category in categorization_pipeline.categories for document in category.documents()
]
categorization_pipeline.test_documents = [
document for category in categorization_pipeline.categories for document in category.test_documents()
]
# initialize all necessary parts of the AI – in the example we run an AI that uses images and does not use text
categorization_pipeline.category_vocab = categorization_pipeline.build_template_category_vocab()
# image processing model
image_model = EfficientNet(name='efficientnet_b0')
# building a classifier for the page images
categorization_pipeline.classifier = PageImageCategorizationModel(
image_model=image_model,
output_dim=len(categorization_pipeline.category_vocab),
)
categorization_pipeline.build_preprocessing_pipeline(use_image=True)
# fit the AI
categorization_pipeline.fit(n_epochs=1, optimizer={'name': 'Adam'})
# evaluate the AI
data_quality = categorization_pipeline.evaluate(use_training_docs=True)
ai_quality = categorization_pipeline.evaluate(use_training_docs=False)
# Categorize a Document
document = project.get_document_by_id(YOUR_DOCUMENT_ID)
categorization_result = categorization_pipeline.categorize(document=document)
for page in categorization_result.pages():
print(f"Found category {page.category} for {page}")
print(f"Found category {categorization_result.category} for {categorization_result}")
# Save and load a pickle file for the model
pickle_model_path = categorization_pipeline.save(reduce_weight=False)
categorization_pipeline_loaded = CategorizationAI.load_model(pickle_model_path)
After you have trained and saved your custom AI, you can upload it using the steps from the tutorial
or using the method upload_ai_model()
as described in Upload your AI, provided that you have the Superuser rights.
Categorization AI Overview Diagram¶
In the first diagram, we show the class hierarchy of the available Categorization Models within the SDK. Note that the Multimodal Model simply consists of a Multi Layer Perceptron to concatenate the feature outputs of a Text Model and an Image Model, such that the predictions from both Models can be unified in a unique Category prediction.
In the second diagram, we show how these models are contained within a Model-based Categorization AI. The Categorization AI class provides the high level interface to categorize Documents, as exemplified in the code examples above. It uses a Page Categorization Model to categorize each Page. The Page Categorization Model is a container for Categorization Models: it wraps the feature output layers of each contained Model with a Dropout Layer and a Fully Connected Layer.