Create and save a custom Categorization AI


Prerequisites:

  • Data Layer concepts of Konfuzio: Project, Document, Category, Page

  • AI concepts of Konfuzio: Categorization

  • Understanding of OOP: Classes, inheritance

Difficulty: Medium

Goal: Learn how to create a custom Categorization AI with manually defined architecture and save it in a Bento containerized format.


Environment

You need to install the Konfuzio SDK before diving into the tutorial.
To get up and running quickly, you can use our Colab Quick Start notebook.
Open In Colab

As an alternative you can follow the installation section to install and initialize the Konfuzio SDK locally or on an environment of your choice.

Introduction

This tutorial explains how to build and train a custom Categorization AI locally, how to save it and upload it to the Konfuzio Server. If you run this tutorial in Colab and experience any version compatibility issues when working with the SDK, restart the runtime and initialize the SDK once again; this will resolve the issue.

Note: you don’t necessarily need to create the AI from scratch if you already have some Document-processing architecture. You just need to wrap it into the class that corresponds to our Categorization AI structure. Follow the steps in this tutorial to find out what are the requirements for that.

By default, any Categorization AI class should derive from the AbstractCategorizationModel class and implement methods __init__, fit, _categorize_page, build_bento, bento_metadata, and entrypoint_methods methods.

Let’s make necessary imports and define the class. In __init__, you can either not make any changes or initialize key variables required by your custom AI.

import bentoml
import lz4
import os
import pathlib
import torch
from konfuzio_sdk.trainer.document_categorization import AbstractCategorizationAI
from konfuzio_sdk.data import Page, Category, Project
from typing import List

class CustomCategorizationAI(AbstractCategorizationAI):
    """Define a custom Categorization AI."""

    # specify if an AI uses text, images 
    requires_text = True
    requires_images = False
    requires_segmentation = False # always false because it is only required for existing segmentation AIs

    def __init__(self, categories: List[Category], *args, **kwargs) -> None:
        """
        Initialize a class.

        categories: A list of Categories between which the AI is going to distinguish.
        """
        # a list of Categories between which the AI will differentiate
        super().__init__(categories)

Then we need to define fit method. It can contain any custom architecture, for instance, a multi-layered perceptron, or some hardcoded logic like Name-based Categorization. This method does not return anything; rather, it modifies the self.model if you provide this attribute.

This method is allowed to be implemented as a no-op if you provide the trained model in other ways.

    def fit(self) -> None:
        """Fit the classifier on training Documents."""
        self.classifier_iterator = build_document_classifier_iterator(
                    self.documents,
                    self.train_transforms,
                    use_image = True,
                    use_text = False,
                    device='cpu',
                )
        self.classifier._fit_classifier(self.classifier_iterator, **kwargs)

Next, we need to define how the model assigns a Category to a Page inside a _categorize_page method. NB: The result of extraction must be the input Page with added Categorization attribute Page.category.

    def _categorize_page(self, page: Page) -> Page:
        """
        Assign Category Annotations to the Document's Page.
        
        page: A Page to which Category Annotations are assigned.
        """
        page_image = page.get_image()
        predicted_category_id, predicted_confidence = self._predict(page_image)
        
        for category in self.categories:
            if category.id_ == predicted_category_id:
                _ = CategoryAnnotation(category=category, confidence=predicted_confidence, page=page)
        
        return page

BentoML allows to containerize the AI models and run them independently. A custom Categorization AI class needs three methods to support usage with BentoML:

  • build_bento() method which allows building the Bento archive that can later be uploaded to Konfuzio app or an on-prem installation, as well as served and used locally;

  • entrypoint_methods(), a property that defines what methods will be exposed in a resulting Bento model (typically categorize is enough);

  • bento_metadata() which defines what metadata will be saved in the model: whether it requires usage of images, text and/or segmentation, and also the formats of an expected request and response in Pydantic format.

    @property
    def entrypoint_methods(self) -> dict:
        """List the model's methods to make it accessible in a containerized instance of the AI."""
        return {
            'categorize': {'batchable': False}
        }

    @property
    def bento_metadata(self) -> dict:
        """List if the AI requires processing of images/text and segmentation, and specify formats of request and response. Needed for server support of the AI. """
        return {
            'requires_images': getattr(self, 'requires_images', False),
            'requires_segmentation': getattr(self, 'requires_segmentation', False),
            'requires_text': getattr(self, 'requires_text', False),
            'request': 'CategorizeRequest20240729', 
            'response': 'CategorizeResponse20240729', 
        }

    def build_bento(self, bento_model) -> Bento:
        """Build a archived instance of the AI."""
        bento_base_dir = os.path.dirname(os.path.abspath(__file__)) + '/../bento' # specify your own path if the root folder where konfuzio_sdk is stored has a different name
        dict_metadata = self.project.create_project_metadata_dict()

        #  create a temporary directory for a future bento archive
        with tempfile.TemporaryDirectory() as temp_dir:
            # copy bento directories to temp_dir
            shutil.copytree(bento_base_dir + '/categorization', temp_dir + '/categorization') # specify a different directory name if yours differs from the default one
            shutil.copytree(bento_base_dir + '/base', temp_dir + '/base') 
            # copy __init__.py file
            shutil.copy(bento_base_dir + '/__init__.py', temp_dir + '__init__.py')
            # include metadata
            with open(f'{temp_dir}/categories_and_label_data.json5', 'w') as f:
                json.dump(dict_metadata, f, indent=2, sort_keys=True)
            # include the AI model name so the service can load it correctly
            with open(f'{temp_dir}/AI_MODEL_NAME', 'w') as f:
                f.write(self._pkl_name)

            built_bento = bentoml.bentos.build(
                name=f"categorization_{self.category.id_ if self.category else '0'}",
                service=f'categorization/categorizationai_service.py:CategorizationService',
                include=[
                    '__init__.py',
                    'base/*.py',
                    'categorization/*.py', # specify a different directory name if yours differs from the default one
                    'categories_and_label_data.json5',
                    'AI_MODEL_NAME',
                ],
                labels=self.bento_metadata,
                python={
                    'packages': [
                        'https://github.com/konfuzio-ai/konfuzio-sdk/archive/refs/heads/master.zip#egg=konfuzio-sdk'
                        # specify any packages you might need aside from konfuzio_sdk, if you use them
                    ],
                    'lock_packages': True,
                },
                build_ctx=temp_dir,
                models=[str(bento_model.tag)],
            )

        return built_bento

After building the class, we need to test it to ensure it works. Let’s make necessary imports, initialize the Project and the AI. You can run the AI over a small subset of Documents so that it does not take too much time.

from konfuzio_sdk.data import Project
from konfuzio_sdk.trainer.document_categorization import (
        EfficientNet,
        PageImageCategorizationModel,
    )

project = Project(id_=YOUR_PROJECT_ID)
categorization_pipeline = CustomCategorizationAI(project.categories)

categorization_pipeline.documents = [
        document for category in categorization_pipeline.categories for document in category.documents()
    ][:5]
categorization_pipeline.test_documents = [
    document for category in categorization_pipeline.categories for document in category.test_documents()
][:5]

Then, define all necessary components of the AI, train and evaluate it.

categorization_pipeline.category_vocab = categorization_pipeline.build_template_category_vocab()
image_model = EfficientNet(name='efficientnet_b0')
categorization_pipeline.classifier = PageImageCategorizationModel(
        image_model=image_model,
        output_dim=len(categorization_pipeline.category_vocab),
    )
categorization_pipeline.build_preprocessing_pipeline(use_image=True)
categorization_pipeline.fit(n_epochs=1)
data_quality = categorization_pipeline.evaluate(use_training_docs=True)
ai_quality = categorization_pipeline.evaluate(use_training_docs=False)
print(data_quality.f1(category=project.categories[0]))
print(ai_quality.f1(category=project.categories[0]))

Now you can categorize a Document.

document = project.get_document_by_id(YOUR_DOCUMENT_ID)
categorization_result = categorization_pipeline.categorize(document=document)
for page in categorization_result.pages():
    print(f"Found category {page.category} for {page}")
print(f"Found category {categorization_result.category} for {categorization_result}")

Finally, save the model and check that it is loadable. Saving can be done in two different ways: as a compressed model and as a Bento containerized instance. Note: Now both of methods are supported by Konfuzio Server, but we will depreciate the former method when Bento components are developed for all AI types.

Saving and loading a model via compression:

pickle_model_path = categorization_pipeline.save(reduce_weight=False)
categorization_pipeline_loaded = CustomCategorizationAI.load_model(pickle_model_path)

Saving a model as a Bento instance:

bento, path_to_bento = categorization_pipeline.save_bento(output_dir=project.model_folder)

To test that a model works, you can find out the model’s name in BentoML’s local registry and serve it via the following command:

bento_name = bento.tag.name + ':' + bento.tag.version
%%bash
bentoml serve extraction_0:gmu2lrbyugahyasc # replace the name to your value of bento_name variable

After you have trained and saved your custom AI, you can upload it using the steps from the tutorial or using the method upload_ai_model() as described in Upload your AI, provided that you have the Superuser rights.

Conclusion

In this tutorial, we have walked through the construction, training, testing and preparation to uploading of the custom Categorization AI. Below is the full code to accomplish this task. Note that this class is completely demonstrative, and to make it functional you would need to replace contents of defined methods with your own code.

import bentoml
import lz4
import os
import pathlib
import torch
from konfuzio_sdk.trainer.document_categorization import AbstractCategorizationAI
from konfuzio_sdk.data import Page, Category, Project
from typing import List

class CustomCategorizationAI(AbstractCategorizationAI):
    """Define a custom Categorization AI."""

    # specify if an AI uses text, images 
    requires_text = True
    requires_images = False
    requires_segmentation = False # always false because it is only required for existing segmentation AIs

    def __init__(self, categories: List[Category], *args, **kwargs) -> None:
        """
        Initialize a class.

        categories: A list of Categories between which the AI is going to distinguish.
        """
        # a list of Categories between which the AI will differentiate
        super().__init__(categories)
    
    def fit(self) -> None:
        """Fit the classifier on training Documents."""
        self.classifier_iterator = build_document_classifier_iterator(
                    self.documents,
                    self.train_transforms,
                    use_image = True,
                    use_text = False,
                    device='cpu',
                )
        self.classifier._fit_classifier(self.classifier_iterator, **kwargs)
    
    def _categorize_page(self, page: Page) -> Page:
        """
        Assign Category Annotations to the Document's Page.
        
        page: A Page to which Category Annotations are assigned.
        """
        page_image = page.get_image()
        predicted_category_id, predicted_confidence = self._predict(page_image)
        
        for category in self.categories:
            if category.id_ == predicted_category_id:
                _ = CategoryAnnotation(category=category, confidence=predicted_confidence, page=page)
        
        return page
    
    @property
    def entrypoint_methods(self) -> dict:
        """List the model's methods to make it accessible in a containerized instance of the AI."""
        return {
            'categorize': {'batchable': False}
        }

    @property
    def bento_metadata(self) -> dict:
        """List if the AI requires processing of images/text and segmentation, and specify formats of request and response. Needed for server support of the AI. """
        return {
            'requires_images': getattr(self, 'requires_images', False),
            'requires_segmentation': getattr(self, 'requires_segmentation', False),
            'requires_text': getattr(self, 'requires_text', False),
            'request': 'CategorizeRequest20240729', 
            'response': 'CategorizeResponse20240729', 
        }

    def build_bento(self, bento_model) -> Bento:
        """Build a archived instance of the AI."""
        bento_base_dir = os.path.dirname(os.path.abspath(__file__)) + '/../bento' # specify your own path if the root folder where konfuzio_sdk is stored has a different name
        dict_metadata = self.project.create_project_metadata_dict()

        #  create a temporary directory for a future bento archive
        with tempfile.TemporaryDirectory() as temp_dir:
            # copy bento directories to temp_dir
            shutil.copytree(bento_base_dir + '/categorization', temp_dir + '/categorization') # specify a different directory name if yours differs from the default one
            shutil.copytree(bento_base_dir + '/base', temp_dir + '/base') 
            # copy __init__.py file
            shutil.copy(bento_base_dir + '/__init__.py', temp_dir + '__init__.py')
            # include metadata
            with open(f'{temp_dir}/categories_and_label_data.json5', 'w') as f:
                json.dump(dict_metadata, f, indent=2, sort_keys=True)
            # include the AI model name so the service can load it correctly
            with open(f'{temp_dir}/AI_MODEL_NAME', 'w') as f:
                f.write(self._pkl_name)

            built_bento = bentoml.bentos.build(
                name=f"categorization_{self.category.id_ if self.category else '0'}",
                service=f'categorization/categorizationai_service.py:CategorizationService',
                include=[
                    '__init__.py',
                    'base/*.py',
                    'categorization/*.py', # specify a different directory name if yours differs from the default one
                    'categories_and_label_data.json5',
                    'AI_MODEL_NAME',
                ],
                labels=self.bento_metadata,
                python={
                    'packages': [
                        'https://github.com/konfuzio-ai/konfuzio-sdk/archive/refs/heads/master.zip#egg=konfuzio-sdk'
                        # specify any packages you might need aside from konfuzio_sdk, if you use them
                    ],
                    'lock_packages': True,
                },
                build_ctx=temp_dir,
                models=[str(bento_model.tag)],
            )

        return built_bento

project = Project(id_=YOUR_PROJECT_ID)
categorization_pipeline = CustomCategorizationAI(project.categories)

categorization_pipeline.documents = [
        document for category in categorization_pipeline.categories for document in category.documents()
    ][:5]
categorization_pipeline.test_documents = [
    document for category in categorization_pipeline.categories for document in category.test_documents()
][:5]

categorization_pipeline.category_vocab = categorization_pipeline.build_template_category_vocab()
image_model = EfficientNet(name='efficientnet_b0')
categorization_pipeline.classifier = PageImageCategorizationModel(
        image_model=image_model,
        output_dim=len(categorization_pipeline.category_vocab),
    )
categorization_pipeline.build_preprocessing_pipeline(use_image=True)
categorization_pipeline.fit(n_epochs=1)

data_quality = categorization_pipeline.evaluate(use_training_docs=True)
ai_quality = categorization_pipeline.evaluate(use_training_docs=False)
print(data_quality.f1(category=project.categories[0]))
print(ai_quality.f1(category=project.categories[0]))

document = project.get_document_by_id(YOUR_DOCUMENT_ID)
categorization_result = categorization_pipeline.categorize(document=document)
for page in categorization_result.pages():
    print(f"Found category {page.category} for {page}")
print(f"Found category {categorization_result.category} for {categorization_result}")

pickle_model_path = categorization_pipeline.save_bento()
categorization_pipeline_loaded = CustomCategorizationAI.load_model(pickle_model_path)
bento, path_to_bento = categorization_pipeline.save_bento(output_dir=project.model_folder)
bento_name = bento.tag.name + ':' + bento.tag.version

What’s next?