{ "cells": [ { "cell_type": "markdown", "id": "e69a26a4", "metadata": { "lines_to_next_cell": 0 }, "source": [ "## Categorize a Document using Categorization AI\n", "\n", "---\n", "\n", "**Prerequisites:**\n", "- Data Layer concepts of Konfuzio SDK: Document, Category, Project, Page\n", "- AI concepts of Konfuzio SDK: Extraction\n", "- Understanding of ML concepts: train-validation loop, optimizer, epochs\n", "\n", "**Difficulty:** Medium\n", "\n", "**Goal:** Learn how to categorize a Document using one of Categorization AIs pre-constructed by Konfuzio\n", "\n", "---\n", "\n", "### Environment\n", "You need to install the Konfuzio SDK before diving into the tutorial. \\\n", "To get up and running quickly, you can use our Colab Quick Start notebook. \\\n", "\"Open\n", "\n", "As an alternative you can follow the [installation section](../get_started.html#install-sdk) to install and initialize the Konfuzio SDK locally or on an environment of your choice.\n", "\n", "### Introduction\n", "\n", "To categorize a Document with a Categorization AI constructed by Konfuzio, there are two main options: the Name-based Categorization AI and the more complex Model-based Categorization AI.\n", "\n", "### Name-based Categorization AI\n", "\n", "The name-based Categorization AI is a simple logic that checks if a name of the Category appears in the Document. It can be used to categorize Documents when no model-based Categorization AI is available.\n", "\n", "Let's begin with making imports, initializing the Categorization model and calling the Document to categorize." ] }, { "cell_type": "code", "execution_count": null, "id": "ad86c9dc", "metadata": { "editable": true, "slideshow": { "slide_type": "" }, "tags": [ "remove-cell" ], "vscode": { "languageId": "plaintext" } }, "outputs": [], "source": [ "import logging\n", "import os\n", "import konfuzio_sdk\n", "from konfuzio_sdk.api import get_project_list\n", "from konfuzio_sdk.data import Project\n", "from tests.variables import TEST_SNAPSHOT_ID\n", "\n", "logging.getLogger(\"konfuzio_sdk\").setLevel(logging.ERROR)\n", "projects = get_project_list()\n", "# we want to get the last instance of a project restored from a snapshot because creating a new one each time takes longer \n", "YOUR_PROJECT_ID = next(project['id'] for project in reversed(projects) if TEST_SNAPSHOT_ID in project['name'])\n", "project = Project(id_=YOUR_PROJECT_ID)\n", "YOUR_CATEGORY_ID = project.get_category_by_name('Lohnabrechnung').id_\n", "original_document_text = Project(id_=46).get_document_by_id(44823).text\n", "YOUR_DOCUMENT_ID = [document for document in project.documents if document.text == original_document_text][0].id_" ] }, { "cell_type": "code", "execution_count": null, "id": "84b259d0", "metadata": { "editable": true, "slideshow": { "slide_type": "" }, "tags": [ "remove-output" ], "vscode": { "languageId": "plaintext" } }, "outputs": [], "source": [ "from konfuzio_sdk.data import Project\n", "from konfuzio_sdk.trainer.document_categorization import NameBasedCategorizationAI\n", "\n", "project = Project(id_=YOUR_PROJECT_ID)\n", "categorization_model = NameBasedCategorizationAI(project.categories)\n", "test_document = project.get_document_by_id(YOUR_DOCUMENT_ID)" ] }, { "cell_type": "markdown", "id": "e5e5702c", "metadata": { "lines_to_next_cell": 0 }, "source": [ "Then, we categorize the Document. The Categorization Model returns a copy of the SDK Document with Category attribute (use inplace=True to maintain the original Document instead).\n", "If the input Document is already categorized, the already present Category is used (use recategorize=True if you want to force a recategorization). Each Page is categorized individually." ] }, { "cell_type": "code", "execution_count": null, "id": "8d4ac69b", "metadata": {}, "outputs": [], "source": [ "result_doc = categorization_model.categorize(document=test_document)\n", "\n", "for page in result_doc.pages():\n", " assert page.category == project.categories[0]\n", " print(f\"Found category {page.category} for {page}\")" ] }, { "cell_type": "markdown", "id": "87440bf2", "metadata": { "lines_to_next_cell": 0 }, "source": [ "The Category of the Document is defined when all pages' Categories are equal. If the Document contains mixed Categories, only the Page level Category will be defined, and the Document level Category will be NO_CATEGORY." ] }, { "cell_type": "code", "execution_count": null, "id": "814aa175", "metadata": {}, "outputs": [], "source": [ "print(f\"Found category {result_doc.category} for {result_doc}\")" ] }, { "cell_type": "markdown", "id": "953825c5", "metadata": {}, "source": [ "### Model-based Categorization AI\n", "\n", "For better results you can build, train and test a Categorization AI using Image Models and Text Models to classify the image and text of each Page.\n", "\n", "Let's start with the imports and initializing the Project." ] }, { "cell_type": "code", "execution_count": null, "id": "384d3a66", "metadata": { "editable": true, "lines_to_next_cell": 0, "slideshow": { "slide_type": "" } }, "outputs": [], "source": [ "from konfuzio_sdk.data import Project, Document\n", "from konfuzio_sdk.trainer.document_categorization import build_categorization_ai_pipeline\n", "from konfuzio_sdk.trainer.document_categorization import ImageModel, TextModel, CategorizationAI\n", "\n", "project = Project(id_=YOUR_PROJECT_ID)" ] }, { "cell_type": "code", "execution_count": null, "id": "48f05465", "metadata": { "editable": true, "slideshow": { "slide_type": "" }, "tags": [ "remove-cell" ] }, "outputs": [], "source": [ "logging.getLogger(\"konfuzio_sdk\").setLevel(logging.CRITICAL)\n", "logging.getLogger(\"timm\").setLevel(logging.CRITICAL)\n", "for doc in project.documents + project.test_documents:\n", " doc.get_images()\n", "for document in project.documents[3:] + project.test_documents[1:]:\n", " document.dataset_status = 4\n", "original_document_text = Project(id_=46).get_document_by_id(44864).text\n", "cur_document_id = [document for document in project._documents if document.text == original_document_text][0].id_\n", "project.get_document_by_id(cur_document_id).dataset_status = 4" ] }, { "cell_type": "markdown", "id": "17298e06", "metadata": {}, "source": [ "Build the Categorization AI architecture using a template of pre-built Image and Text classification Models. In this tutorial, we use `EfficientNetB0` and `NBOWSelfAttention` together." ] }, { "cell_type": "code", "execution_count": null, "id": "673461f2", "metadata": { "editable": true, "slideshow": { "slide_type": "" } }, "outputs": [], "source": [ "categorization_pipeline = build_categorization_ai_pipeline(\n", " categories=project.categories,\n", " documents=project.documents,\n", " test_documents=project.test_documents,\n", " image_model=ImageModel.EfficientNetB0,\n", " text_model=TextModel.NBOWSelfAttention,\n", ")" ] }, { "cell_type": "markdown", "id": "310a6f22", "metadata": { "lines_to_next_cell": 0 }, "source": [ "Train and evaluate the AI. You can specify parameters for training, for example, number of epochs and an optimizer." ] }, { "cell_type": "code", "execution_count": null, "id": "4a01b591", "metadata": { "tags": [ "remove-output" ] }, "outputs": [], "source": [ "categorization_pipeline.fit(n_epochs=1, optimizer={'name': 'Adam'})\n", "data_quality = categorization_pipeline.evaluate(use_training_docs=True)\n", "ai_quality = categorization_pipeline.evaluate()\n", "assert data_quality.f1(None) == 1.0\n", "assert ai_quality.f1(None) == 1.0" ] }, { "cell_type": "markdown", "id": "3d11f727", "metadata": { "lines_to_next_cell": 0 }, "source": [ "Categorize a Document using the newly trained model." ] }, { "cell_type": "code", "execution_count": null, "id": "4b8dc832", "metadata": {}, "outputs": [], "source": [ "document = project.get_document_by_id(YOUR_DOCUMENT_ID)\n", "categorization_result = categorization_pipeline.categorize(document=document)\n", "assert isinstance(categorization_result, Document)\n", "for page in categorization_result.pages():\n", " print(f\"Found category {page.category} for {page}\")" ] }, { "cell_type": "markdown", "id": "a99accdc", "metadata": { "lines_to_next_cell": 0 }, "source": [ "Save the model and check that it can be loaded after that to ensure it could be uploaded to the Konfuzio app or to an on-prem installation." ] }, { "cell_type": "code", "execution_count": null, "id": "3d983253", "metadata": {}, "outputs": [], "source": [ "pickle_ai_path = categorization_pipeline.save()\n", "categorization_pipeline = CategorizationAI.load_model(pickle_ai_path)" ] }, { "cell_type": "code", "execution_count": null, "id": "1cdae91e", "metadata": { "editable": true, "slideshow": { "slide_type": "" }, "tags": [ "remove-cell" ] }, "outputs": [], "source": [ "os.remove(pickle_ai_path)" ] }, { "cell_type": "markdown", "id": "fb3eee3a", "metadata": {}, "source": [ "To prepare the data for training and testing your AI, follow the [data preparation tutorial](https://dev.konfuzio.com/sdk/tutorials/data-preparation/index.html).\n", "\n", "For a list of available Models see all the available [Categorization Models](#categorization-ai-models) below.\n", "\n", "### Categorization AI Models\n", "\n", "When using `build_categorization_ai_pipeline`, you can select which Image Module and/or Text Module to use for \n", "classification. At least one between the Image Model or the Text Model must be specified. Both can also be used \n", "at the same time.\n", "\n", "The list of available Categorization Models is implemented as an Enum containing the following elements:" ] }, { "cell_type": "code", "execution_count": null, "id": "d152be09", "metadata": { "editable": true, "slideshow": { "slide_type": "" }, "tags": [ "remove-output" ] }, "outputs": [], "source": [ "from konfuzio_sdk.trainer.document_categorization import ImageModel, TextModel\n", "\n", "# Image Models\n", "ImageModel.VGG11\n", "ImageModel.VGG13\n", "ImageModel.VGG16\n", "ImageModel.VGG19\n", "ImageModel.EfficientNetB0\n", "ImageModel.EfficientNetB1\n", "ImageModel.EfficientNetB2\n", "ImageModel.EfficientNetB3\n", "ImageModel.EfficientNetB4\n", "ImageModel.EfficientNetB5\n", "ImageModel.EfficientNetB6\n", "ImageModel.EfficientNetB7\n", "ImageModel.EfficientNetB8\n", "\n", "# Text Models\n", "TextModel.NBOW\n", "TextModel.NBOWSelfAttention\n", "TextModel.LSTM\n", "TextModel.BERT" ] }, { "cell_type": "markdown", "id": "f5c353e3", "metadata": {}, "source": [ "See more details about these Categorization Models under [API Reference - Categorization AI](https://dev.konfuzio.com/sdk/sourcecode.html#categorization-ai).\n", "\n", "### Possible configurations\n", "\n", "The following configurations of Categorization AI are tested. Tokenizer, text and image processors can be specified \n", "when building Categorization pipeline locally; text and image processors can be specified when building the pipeline \n", "either locally or on app/on-prem installation.\n", "\n", "Each line stands for a single configuration. If a field is None, it requires specifying it as None; otherwise, the \n", "default value will be applied.\n", "\n", "You can find more information on how to use these configurations, what are default values and where to specify\n", "them [here](https://help.konfuzio.com/modules/projects/index.html?highlight=efficientnet#categorization-ai-parameters).\n", "\n", "| Tokenizer | Text model class | Text model name | Image model class | Image model name |\n", "|-------------------------|-------------------|--------------------------|-------------------|-------------------|\n", "| WhitespaceTokenizer | NBOWSelfAttention | `nbowselfattention` | EfficientNet | `efficientnet_b0` |\n", "| WhitespaceTokenizer | NBOWSelfAttention | `nbowselfattention` | EfficientNet | `efficientnet_b3` |\n", "| WhitespaceTokenizer | NBOW | `nbow` | VGG | `vgg11` |\n", "| WhitespaceTokenizer | LSTM | `lstm` | VGG | `vgg13` |\n", "| ConnectedTextTokenizer | NBOW | `nbow` | VGG | `vgg11` |\n", "| ConnectedTextTokenizer | LSTM | `lstm` | VGG | `vgg13` |\n", "| None | None | None | EfficientNet | `efficientnet_b0` |\n", "| None | None | None | EfficientNet | `efficientnet_b3` |\n", "| None | None | None | VGG | `vgg11` |\n", "| None | None | None | VGG | `vgg13` |\n", "| None | None | None | VGG | `vgg16` |\n", "| None | None | None | VGG | `vgg19` |\n", "| TransformersTokenizer | *BERT* * | `bert-base-german-cased` | None | None |\n", "\n", "***Note**: In this table, we list a single BERT-based model (`bert-base-german-cased`). The following table lists the\n", "possible values for text model versions that can be passed as `name` argument when configuring BERT model for \n", "Categorization.\n", "\n", "### Models compatible with BERT class\n", "\n", "| Name | Embeddings dimension | Language | Number of parameters |\n", "|------------------------------------------------|----------------------|----------|----------------------|\n", "| `bert-base-uncased` | 768 | English | 110 million |\n", "| `distilbert-base-uncased` | 768 | English | 66 million |\n", "| `google/mobilebert-uncased` | 512 | English | 25 million |\n", "| `albert-base-v2` | 768 | English | 12 million |\n", "| `german-nlp-group/electra-base-german-uncased` | 768 | German | 111 million |\n", "| `bert-base-german-cased` | 768 | German | 110 million |\n", "| `dbmdz/bert-base-german-uncased` | 768 | German | 110 million |\n", "| `distilbert-base-german-cased` | 768 | German | 66 million | \n", "| `bert-base-multilingual-cased` | 768 | Multiple | 110 million |\n", "\n", "**Note:** This list is not exhaustive. We only list the models that are fully tested. However, you can use the \n", "[Huggingface hub](https://huggingface.co/models) to find other models that best suit your needs. To ensure a model is \n", "compatible with Categorization AI, initialize it with the SDK's `TransformersTokenizer` class as presented in an example\n", "below, replacing the value of `name` to the name of your model of choice. If a model is compatible, the initialization \n", "will be successful; otherwise, an error about incompatibility will appear." ] }, { "cell_type": "code", "execution_count": null, "id": "6ea98b7d", "metadata": { "tags": [ "skip-execution", "nbval-skip" ] }, "outputs": [], "source": [ "from konfuzio_sdk.trainer.tokenization import TransformersTokenizer\n", "\n", "tokenizer = TransformersTokenizer(name='bert-base-chinese')" ] }, { "cell_type": "markdown", "id": "303ef73d", "metadata": {}, "source": [ "### Configurable parameters\n", "\n", "Every group of models/configuration you decide to use has manually configurable parameters. Follow this section to find \n", "out what parameters are configurable and which models accept them.\n", "\n", "Some of the parameters are universally accepted for training regardless of the model. \n", "\n", "- `n_epochs` - number of times the entire training dataset is passed through the model during training. BERT models\n", "require lower values like 3-5, other models can require higher number, like 20+. Default value is 20.\n", "- `patience` - number of epochs to wait before early stopping if the model's performance on the validation set does \n", "not improve. Default value is 3.\n", "- `optimizer` - algorithm used to update the model's parameters during training. Default value is `AdamW` with learning\n", "rate of `1e-4`.\n", "- `lr_decay` - rate at which the learning rate is reduced over time during training to help the model maximize training\n", "efficiency. Default value is 0.999.\n", "\n", "Other parameters are configurable only for some of the models and might not have a unified default value.\n", "\n", "- `input_dim` - dimensionality of the input data, which represents the number of features or variables in the input.\n", "- `dropout_rate` - fraction of the input units to randomly set to 0 during training to prevent overfitting.\n", "- `emb_dim` - dimensionality of the embeddings (vector representation). Default value is 64.\n", "- `n_heads` - number of attention heads in multi-head attention mechanisms which enable the model to attend to different\n", "parts of the input simultaneously. Note that `n_heads` must be a factor of `emb_dim`, i.e. `emb_dim % n_heads == 0`.\n", "- `hid_dim` - dimensionality of the hidden states in the model. Default value is 256.\n", "- `n_layers` - number of layers in the model. Default value is 2.\n", "- `bidirectional` - whether to use bidirectional processing in LSTM, enabling the model to consider both past and future\n", "context. Default value is True.\n", "- `name` - a name or identifier for the model.\n", "- `freeze` - whether to freeze the weights of certain layers or parameters during training, preventing them from being \n", "updated.\n", "\n", "| Model | `input_dim` | `dropout_rate` | `emb_dim` | `n_heads` | `hid_dim` | `n_layers` | `bidirectional` | `name` | `freeze` |\n", "|-------------------|-------------|----------------|-----------|-----------|-----------|------------|-----------------|--------|----------|\n", "| NBOW | ✔ | ✔ | ✔ | ✘ | ✘ | ✘ | ✘ | ✘ | ✘ |\n", "| NBOWSelfAttention | ✔ | ✘ | ✔ | ✔ | ✘ | ✘ | ✘ | ✘ | ✘ |\n", "| LSTM | ✔ | ✔ | ✔ | ✘ | ✔ | ✔ | ✔ | ✘ | ✘ |\n", "| BERT | ✘ | ✘ | ✘ | ✘ | ✘ | ✘ | ✘ | ✔ | ✔ |\n", "| VGG | ✘ | ✘ | ✘ | ✘ | ✘ | ✘ | ✘ | ✔ | ✔ |\n", "| EfficientNet | ✘ | ✘ | ✘ | ✘ | ✘ | ✘ | ✘ | ✔ | ✔ |" ] }, { "cell_type": "markdown", "id": "6fe53b08", "metadata": {}, "source": [ "### Conclusion\n", "\n", "In this tutorial, we presented two different ways to categorize a Document using AIs constructed by Konfuzio and provided possible configurations that can be used in model-based Categorization.\n", "\n", "### What's next?\n", "\n", "- [Create your own custom Categorization AI](https://dev.konfuzio.com/sdk/tutorials/create-custom-categorization-ai/index.html)" ] } ], "metadata": { "kernelspec": { "display_name": "Python 3 (ipykernel)", "language": "python", "name": "python3" } }, "nbformat": 4, "nbformat_minor": 5 }