Document Categorization

When uploading a Document to Konfuzio, the first step is to assign it to a Category. This can be done manually, or automatically using a Categorization AI. Categorization always happens on a Project level.

Setting the Category of a Document and its individual Pages Manually

You can initialize a Document with a Category. You can also use Document.set_category to set a Document’s Category after it has been initialized. This will count as if a human manually revised it.

project = Project(id_=YOUR_PROJECT_ID)
my_category = project.get_category_by_id(YOUR_CATEGORY_ID)

my_document = Document(text="My text.", project=project, category=my_category)
assert my_document.category == my_category
assert my_document.category_is_revised is True

If a Document is initialized with no Category, it will automatically be set to NO_CATEGORY. Another Category can be manually set later.

document = project.get_document_by_id(YOUR_DOCUMENT_ID)
assert document.category == project.no_category
assert document.category == my_category
assert document.category_is_revised is True
# This will set it for all of its Pages as well.
for page in document.pages():
    assert page.category == my_category

If you use a Categorization AI to automatically assign a Category to a Document (such as the NameBasedCategorizationAI, each Page will be assigned a Category Annotation with predicted confidence information, and the following properties will be accessible. You can also find these documented under API Reference - Document, API Reference - Page and API Reference - Category Annotation.




The AI predicted Category of this Category


The AI predicted confidence of this Category


List of predicted Category Annotations at the
Document level.


Get the maximum confidence predicted Category
Annotation, or the human revised one if present.


Get the maximum confidence predicted Category
or the human revised one if present.


Returns a Category only if all Pages have same
Category, otherwise None. In that case, it hints
to the fact that the Document should probably
be revised or split into Documents with
consistently categorized Pages.


List of predicted Category Annotations at the
Page level.


Get the maximum confidence predicted Category
Annotation or the one revised by the user for this


Get the maximum confidence predicted Category
or the one revised by user for this Page.

To categorize a Document with a Categorization AI, we have two main options: the Name-based Categorization AI and the more complex Model-based Categorization AI.

Name-based Categorization AI

The name-based Categorization AI is a good fallback logic using the name of the Category to categorize Documents when no model-based Categorization AI is available:

from import Project
from konfuzio_sdk.trainer.document_categorization import NameBasedCategorizationAI

# Set up your Project.
project = Project(id_=YOUR_PROJECT_ID)

# Initialize the Categorization Model.
categorization_model = NameBasedCategorizationAI(project.categories)

# Retrieve a Document to categorize.
test_document = project.get_document_by_id(YOUR_DOCUMENT_ID)

# The Categorization Model returns a copy of the SDK Document with Category attribute
# (use inplace=True to maintain the original Document instead).
# If the input Document is already categorized, the already present Category is used
# (use recategorize=True if you want to force a recategorization).
result_doc = categorization_model.categorize(document=test_document)

# Each Page is categorized individually.
for page in result_doc.pages():
    assert page.category == project.categories[0]
    print(f"Found category {page.category} for {page}")

# The Category of the Document is defined when all pages' Categories are equal.
# If the Document contains mixed Categories, only the Page level Category will be defined,
# and the Document level Category will be NO_CATEGORY.
print(f"Found category {result_doc.category} for {result_doc}")

Model-based Categorization AI

For better results you can build, train and test a Categorization AI using Image Models and Text Models to classify the image and text of each Page:

from import Project, Document
from konfuzio_sdk.trainer.document_categorization import build_categorization_ai_pipeline
from konfuzio_sdk.trainer.document_categorization import ImageModel, TextModel, CategorizationAI

# Set up your Project.
project = Project(id_=YOUR_PROJECT_ID)
# Build the Categorization AI architecture using a template
# of pre-built Image and Text classification Models.
categorization_pipeline = build_categorization_ai_pipeline(

# Train the AI., optimizer={'name': 'Adam'})

# Evaluate the AI
data_quality = categorization_pipeline.evaluate(use_training_docs=True)
ai_quality = categorization_pipeline.evaluate()
assert data_quality.f1(None) == 1.0
assert ai_quality.f1(None) == 1.0

# Categorize a Document
document = project.get_document_by_id(YOUR_DOCUMENT_ID)
categorization_result = categorization_pipeline.categorize(document=document)
assert isinstance(categorization_result, Document)
for page in categorization_result.pages():
    print(f"Found category {page.category} for {page}")

# Save and load a pickle file for the AI
pickle_ai_path =
categorization_pipeline = CategorizationAI.load_model(pickle_ai_path)

To prepare the data for training and testing your AI, follow the data preparation tutorial.

For a list of available Models see all the available Categorization Models below.

Categorization AI Models

When using build_categorization_ai_pipeline, you can select which Image Module and/or Text Module to use for classification. At least one between the Image Model or the Text Model must be specified. Both can also be used at the same time.

The list of available Categorization Models is implemented as an Enum containing the following elements:

from konfuzio_sdk.trainer.document_categorization import ImageModel, TextModel

# Image Models

# Text Models

See more details about these Categorization Models under API Reference - Categorization AI.

Create a custom Categorization AI

This section explains how to train a custom Categorization AI locally, how to save it and upload it to the Konfuzio Server. If you run this tutorial in Colab and experience any version compatibility issues when working with the SDK, restart the runtime and initialize the SDK once again; this will resolve the issue.

Note: you don’t necessarily need to create the AI from scratch if you already have some document-processing architecture. You just need to wrap it into the class that corresponds to our Categorization AI structure. Follow the steps in this tutorial to find out what are the requirements for that.

Note: currently, the Server supports AI models created using torch<2.0.0.

By default, any Categorization AI class should derive from the AbstractCategorizationModel class and implement the following methods:

from konfuzio_sdk.trainer.document_categorization import AbstractCategorizationAI
from import Page, Category
from typing import List

class CustomCategorizationAI(AbstractCategorizationAI):
    def __init__(self, categories: List[Category], *args, **kwargs):
        # a list of Categories between which the AI will differentiate

    # initialize key variables required by the custom AI:
    # for instance, self.documents and self.test_documents to train and test the AI on, self.categories to determine
    # which Categories will the AI be able to predict

    def fit(self):

    # Define architecture and training that the model undergoes, i.e. a NN architecture or a custom hardcoded logic
    # for instance:
    # self.classifier_iterator = build_document_classifier_iterator(
    #             self.documents,
    #             self.train_transforms,
    #             use_image = True,
    #             use_text = False,
    #             device='cpu',
    #         )
    # self.classifier._fit_classifier(self.classifier_iterator, **kwargs)
    # This method does not return anything; rather, it modifies the self.model if you provide this attribute.
    # This method is allowed to be implemented as a no-op if you provide the trained model in other ways

    def _categorize_page(self, page: Page) -> Page:

    # define how the model assigns a Category to a Page.
    # for instance:
    # predicted_category_id, predicted_confidence = self._predict(page_image)
    # for category in self.categories:
    #     if category.id_ == predicted_category_id:
    #         _ = CategoryAnnotation(category=category, confidence=predicted_confidence, page=page)
    # **NB:** The result of extraction must be the input Page with added Categorization attribute `Page.category`

    def save(self, path: str):

    # define how to save a model in a .pt format – for example, in a way it's defined in the CategorizationAI
    #  data_to_save = {
    #             'tokenizer': self.tokenizer,
    #             'image_preprocessing': self.image_preprocessing,
    #             'image_augmentation': self.image_augmentation,
    #             'text_vocab': self.text_vocab,
    #             'category_vocab': self.category_vocab,
    #             'classifier': self.classifier,
    #             'eval_transforms': self.eval_transforms,
    #             'train_transforms': self.train_transforms,
    #             'model_type': 'CategorizationAI',
    #         }
    #, path)

Example usage of your Custom Categorization AI:

import os
from import Project
from konfuzio_sdk.trainer.document_categorization import (

# Initialize Project and provide the AI training and test data
project = Project(id_=YOUR_PROJECT_ID)  # see

categorization_pipeline = CategorizationAI(project.categories)
categorization_pipeline.categories = project.categories
categorization_pipeline.documents = [
    document for category in categorization_pipeline.categories for document in category.documents()
categorization_pipeline.test_documents = [
    document for category in categorization_pipeline.categories for document in category.test_documents()
# initialize all necessary parts of the AI – in the example we run an AI that uses images and does not use text
categorization_pipeline.category_vocab = categorization_pipeline.build_template_category_vocab()
# image processing model
image_model = EfficientNet(name='efficientnet_b0')
# building a classifier for the page images
categorization_pipeline.classifier = PageImageCategorizationModel(
# fit the AI, optimizer={'name': 'Adam'})

# evaluate the AI
data_quality = categorization_pipeline.evaluate(use_training_docs=True)
ai_quality = categorization_pipeline.evaluate(use_training_docs=False)

# Categorize a Document
document = project.get_document_by_id(YOUR_DOCUMENT_ID)
categorization_result = categorization_pipeline.categorize(document=document)
for page in categorization_result.pages():
    print(f"Found category {page.category} for {page}")
print(f"Found category {categorization_result.category} for {categorization_result}")

# Save and load a pickle file for the model
pickle_model_path =
categorization_pipeline_loaded = CategorizationAI.load_model(pickle_model_path)

After you have trained and saved your custom AI, you can upload it using the steps from the tutorial or using the method upload_ai_model() as described in Upload your AI, provided that you have the Superuser rights.

Categorization AI Overview Diagram

In the first diagram, we show the class hierarchy of the available Categorization Models within the SDK. Note that the Multimodal Model simply consists of a Multi Layer Perceptron to concatenate the feature outputs of a Text Model and an Image Model, such that the predictions from both Models can be unified in a unique Category prediction.

In the second diagram, we show how these models are contained within a Model-based Categorization AI. The Categorization AI class provides the high level interface to categorize Documents, as exemplified in the code examples above. It uses a Page Categorization Model to categorize each Page. The Page Categorization Model is a container for Categorization Models: it wraps the feature output layers of each contained Model with a Dropout Layer and a Fully Connected Layer.