{ "cells": [ { "cell_type": "markdown", "id": "565ff9e8", "metadata": {}, "source": [ "## Train a Context-Aware File Splitting AI\n", "\n", "---\n", "\n", "**Prerequisites:**\n", "\n", "- Data Layer concepts of Konfuzio: Project, Category, Page, Document, Span\n", "- AI concepts of Konfuzio: File Splitting\n", "\n", "**Difficulty:** Medium\n", "\n", "**Goal:** Learn how to train a Context-Aware File Splitting AI and use it to split Documents.\n", "\n", "---\n", "\n", "### Environment\n", "You need to install the Konfuzio SDK before diving into the tutorial. \\\n", "To get up and running quickly, you can use our Colab Quick Start notebook. \\\n", "\"Open\n", "\n", "As an alternative you can follow the [installation section](../get_started.html#install-sdk) to install and initialize the Konfuzio SDK locally or on an environment of your choice.\n", "\n", "### Introduction\n", "\n", "Konfuzio SDK offers several approaches for automatically splitting a multi-Document file into several Documents. One of them is Context-Aware File Splitting AI that uses a context-aware logic. By context-aware we mean a rule-based approach that looks for common strings between the first Pages of all Category's Documents. Upon predicting whether a Page is a potential splitting point (meaning whether it is \n", "first or not), we compare Page's contents to these common first-page strings; if there is occurrence of at least one \n", "such string, we mark a Page to be first (thus meaning it is a splitting point).\n", "\n", "#### Initialize and train Context-Aware File Splitting AI\n", "\n", "In this tutorial we will be using pre-built classes `ContextAwareFileSplittingModel` and `SplittingAI`. Let's start with making necessary imports, initializing the Project and fetching the test Document." ] }, { "cell_type": "code", "execution_count": null, "id": "5fcae002", "metadata": { "editable": true, "slideshow": { "slide_type": "" }, "tags": [ "remove-cell" ] }, "outputs": [], "source": [ "import logging\n", "import os\n", "from konfuzio_sdk.samples import LocalTextProject\n", "logging.getLogger(\"konfuzio_sdk\").setLevel(logging.ERROR)\n", "logging.getLogger(\"tqdm\").setLevel(logging.ERROR)\n", "YOUR_PROJECT_ID = 46\n", "YOUR_DOCUMENT_ID = 44865" ] }, { "cell_type": "code", "execution_count": null, "id": "81a6e1e9", "metadata": { "editable": true, "lines_to_next_cell": 0, "slideshow": { "slide_type": "" }, "tags": [ "remove-output" ], "vscode": { "languageId": "plaintext" } }, "outputs": [], "source": [ "from konfuzio_sdk.data import Page, Category, Project\n", "from konfuzio_sdk.trainer.file_splitting import SplittingAI, ContextAwareFileSplittingModel\n", "from konfuzio_sdk.tokenizer.regex import ConnectedTextTokenizer\n", "\n", "project = Project(id_=YOUR_PROJECT_ID)\n", "test_document = project.get_document_by_id(YOUR_DOCUMENT_ID)" ] }, { "cell_type": "code", "execution_count": null, "id": "d0644859", "metadata": { "tags": [ "remove-cell" ] }, "outputs": [], "source": [ "from copy import deepcopy\n", "\n", "project = LocalTextProject()\n", "test_document = ConnectedTextTokenizer().tokenize(deepcopy(project.get_document_by_id(9)))\n", "project.categories = [project.get_category_by_id(3), project.get_category_by_id(4)]" ] }, { "cell_type": "markdown", "id": "67b24c85", "metadata": {}, "source": [ "Then, initialize a Context-Aware File Splitting Model and \"fit\" it on the Project's Categories. Tokenizer is needed to split the texts of the Documents in the Categories into the groups among which the algorhythm will search for the intersections.\n", "\n", "`allow_empty_categories` parameter allows to have Categories that have Documents so diverse that there has not been any intersections found for them." ] }, { "cell_type": "code", "execution_count": null, "id": "9e796da1", "metadata": { "editable": true, "slideshow": { "slide_type": "" } }, "outputs": [], "source": [ "file_splitting_model = ContextAwareFileSplittingModel(\n", " categories=project.categories, tokenizer=ConnectedTextTokenizer()\n", ")\n", "\n", "file_splitting_model.fit(allow_empty_categories=True)" ] }, { "cell_type": "markdown", "id": "2f797390", "metadata": {}, "source": [ "Save the model:" ] }, { "cell_type": "code", "execution_count": null, "id": "3a03c11b", "metadata": { "editable": true, "slideshow": { "slide_type": "" } }, "outputs": [], "source": [ "save_path = file_splitting_model.save(include_konfuzio=True)" ] }, { "cell_type": "markdown", "id": "69d9bd8d", "metadata": {}, "source": [ "Run the prediction to ensure it is able to predict the split points (first Pages) correctly:" ] }, { "cell_type": "code", "execution_count": null, "id": "55940601", "metadata": { "editable": true, "slideshow": { "slide_type": "" } }, "outputs": [], "source": [ "for page in test_document.pages():\n", " pred = file_splitting_model.predict(page)\n", " if pred.is_first_page:\n", " print(\n", " 'Page {} is predicted as the first. Confidence: {}.'.format(page.number, page.is_first_page_confidence)\n", " )\n", " else:\n", " print('Page {} is predicted as the non-first.'.format(page.number))" ] }, { "cell_type": "markdown", "id": "638ad1af", "metadata": {}, "source": [ "#### Use the model with the Splitting AI\n", "\n", "Splitting AI is a more high-level interface to Context Aware File Splitting Model and any other models that can be developed for File Splitting purposes. It takes a Document as an input, rather than individual Pages, because it utilizes page-level prediction of possible split points and returns Document or Documents with changes depending on the prediction mode.\n", "\n", "You can load a pre-saved model or pass an initialized instance as the input. In this example, we load a previously saved one." ] }, { "cell_type": "code", "execution_count": null, "id": "b2ad9709", "metadata": { "editable": true, "slideshow": { "slide_type": "" }, "vscode": { "languageId": "plaintext" } }, "outputs": [], "source": [ "model = ContextAwareFileSplittingModel.load_model(save_path)\n", "\n", "splitting_ai = SplittingAI(model)" ] }, { "cell_type": "markdown", "id": "1c7bf9ff", "metadata": {}, "source": [ "Splitting AI can be run in two modes: returning a list of Sub-Documents as the result of the input Document splitting or returning a copy of the input Document with Pages predicted as first having an attribute `is_first_page`. The flag `return_pages` has to be True for the latter; we will use it for an example." ] }, { "cell_type": "code", "execution_count": null, "id": "30dcc747", "metadata": { "editable": true, "slideshow": { "slide_type": "" } }, "outputs": [], "source": [ "new_document = splitting_ai.propose_split_documents(test_document, return_pages=True)\n", "\n", "for page in new_document[0].pages():\n", " if page.is_first_page:\n", " print(\n", " 'Page {} is predicted as the first. Confidence: {}.'.format(page.number, page.is_first_page_confidence)\n", " )\n", " else:\n", " print('Page {} is predicted as the non-first.'.format(page.number))" ] }, { "cell_type": "markdown", "id": "53fb3ec2", "metadata": {}, "source": [ "### Conclusion\n", "\n", "In this tutorial, we have walked through the essential steps for training and using Context-Aware File Splitting Model. Below is the full code to accomplish this task:" ] }, { "cell_type": "code", "execution_count": null, "id": "b56b6c88", "metadata": { "editable": true, "slideshow": { "slide_type": "" }, "tags": [ "skip-execution", "nbval-skip" ] }, "outputs": [], "source": [ "from konfuzio_sdk.data import Page, Category, Project\n", "from konfuzio_sdk.trainer.file_splitting import SplittingAI, ContextAwareFileSplittingModel\n", "from konfuzio_sdk.tokenizer.regex import ConnectedTextTokenizer\n", "\n", "project = Project(id_=YOUR_PROJECT_ID)\n", "test_document = project.get_document_by_id(YOUR_DOCUMENT_ID)\n", "file_splitting_model = ContextAwareFileSplittingModel(\n", " categories=project.categories, tokenizer=ConnectedTextTokenizer()\n", ")\n", "file_splitting_model.documents = file_splitting_model.documents\n", "file_splitting_model.fit(allow_empty_categories=True)\n", "save_path = file_splitting_model.save(include_konfuzio=True)\n", "for page in test_document.pages():\n", " pred = file_splitting_model.predict(page)\n", " if pred.is_first_page:\n", " print(\n", " 'Page {} is predicted as the first. Confidence: {}.'.format(page.number, page.is_first_page_confidence)\n", " )\n", " else:\n", " print('Page {} is predicted as the non-first.'.format(page.number))\n", "model = ContextAwareFileSplittingModel.load_model(save_path)\n", "splitting_ai = SplittingAI(model)\n", "new_document = splitting_ai.propose_split_documents(test_document, return_pages=True)\n", "for page in new_document[0].pages():\n", " if page.is_first_page:\n", " print(\n", " 'Page {} is predicted as the first. Confidence: {}.'.format(page.number, page.is_first_page_confidence)\n", " )\n", " else:\n", " print('Page {} is predicted as the non-first.'.format(page.number))" ] }, { "cell_type": "code", "execution_count": null, "id": "6341431c", "metadata": { "editable": true, "slideshow": { "slide_type": "" }, "tags": [ "remove-cell" ] }, "outputs": [], "source": [ "os.remove(save_path)" ] }, { "cell_type": "markdown", "id": "7b406ed6", "metadata": {}, "source": [ "### What's next?\n", "\n", "- [Learn how to create a custom File Splitting AI](https://dev.konfuzio.com/sdk/tutorials/create-custom-splitting-ai/index.html)\n", "- [Find out how to evaluate a File Splitting AI's performance](https://dev.konfuzio.com/sdk/tutorials/file-splitting-evaluation/index.html)" ] } ], "metadata": { "kernelspec": { "display_name": "Python 3 (ipykernel)", "language": "python", "name": "python3" } }, "nbformat": 4, "nbformat_minor": 5 }