Develop and save a Context-Aware File Splitting AI¶
Prerequisites:
Data Layer concepts of Konfuzio: Project, Category, Span, Document, Page
AI concepts of Konfuzio: File Splitting
Difficulty: Hard
Goal: Guide the user through the steps of constructing of a Context-Aware File Splitting AI to explain better the logic behind it.
Environment¶
You need to install the Konfuzio SDK before diving into the tutorial.
To get up and running quickly, you can use our Colab Quick Start notebook.
As an alternative you can follow the installation section to install and initialize the Konfuzio SDK locally or on an environment of your choice.
Introduction¶
It’s common for multi-paged files to not be perfectly organized, and in some cases, multiple independent Documents may be included in a single file. To ensure that these Documents are properly processed and separated, we will be discussing a method for identifying and splitting them into individual, independent Sub-Documents that does not require any ML-based approach.
Multi-file Document Example
Konfuzio SDK offers two ways for separating Documents that may be included in a single file. One of them is training the instance of the Textual File Splitting Model for file splitting that would predict whether a Page is first or not and running the Splitting AI with it. Another approach is context-aware file splitting logic which is presented by Context Aware File Splitting Model. This approach involves analyzing the contents of each Page and identifying similarities to the first Pages of the Document. It will allow us to define splitting points and divide the Document into multiple Sub-Documents. It’s important to note that this approach is only effective for Documents written in the same language and that the process must be repeated for each Category.
In this tutorial, we will walk you through the process of setting up the ContextAwareFileSplittingModel
class, which
can be found in the code block at the bottom of this page. This class is already implemented and can be imported using
from konfuzio_sdk.trainer.file_splitting import ContextAwareFileSplittingModel
.
Imports and initializing the class¶
Any custom File Splitting AI (derived from AbstractFileSplittingModel
class) requires having the following
methods implemented:
__init__
to initialize key variables required by the custom AI;fit
to define architecture and training that the model undergoes, i.e. a certain NN architecture or a custom hardcoded logic;predict
to define how the model classifies Pages as first or non-first. NB: the classification needs to be run on the Page level, not the Document level – the result of classification is reflected inis_first_page
attribute value, which is unique to the Page class and is not present in Document class. Pages withis_first_page = True
become splitting points, thus, each new Sub-Document has a Page predicted as first as its starting point.
To begin, we will make all the necessary imports and initialize the class:
from konfuzio_sdk.data import Page, Category
from konfuzio_sdk.trainer.file_splitting import AbstractFileSplittingModel
class ContextAwareFileSplittingModel(AbstractFileSplittingModel):
def __init__(self, categories: List[Category], tokenizer, *args, **kwargs):
super().__init__(categories=categories)
self.output_dir = self.project.model_folder
self.tokenizer = tokenizer
self.requires_text = True
self.requires_images = False
The class inherits from AbstractFileSplittingModel
, so we run super().__init__(categories=categories)
to properly
inherit its attributes. The tokenizer
attribute will be used to process the text within the Document, separating it
into Spans. This is done to ensure that the text in all the Documents is split using the same logic (particularly
tokenization by separating on \n
whitespaces by ConnectedTextTokenizer, which is used in the example in the end of the
page) and it will be possible to find common Spans. It will be used for training and testing Documents as well as any
Document that will undergo splitting. It’s important to note that if you run fitting with one Tokenizer and then
reassign it within the same instance of the model, all previously gathered strings will be deleted and replaced by new
ones. requires_images
and requires_text
determine whether these types of data are used for prediction; this is
needed for distinguishing between preprocessing types once a model is passed into the Splitting AI.
ConnectedTextTokenizer explained¶
Here is an example of how ConnectedTextTokenizer works. At first, we have a Document with the untokenized text:
# before tokenization
test_document = project.get_document_by_id(YOUR_DOCUMENT_ID)
print(test_document.text)
Hi all,
I like bread.
I hope to get everything done soon.
Morning,
I'm glad to see you.
Morning,
If we print this Document’s Spans, we will see there are none.
test_document.spans()
[]
Let’s tokenize the Document and check the Spans after that:
test_document = tokenizer.tokenize(test_document)
test_document.spans()
[Span (0, 7): "Hi all,",
Span (8, 21): "I like bread.",
Span (22, 58): "
I hope to get[...]",
Span (59, 68): "
Morning,",
Span (69, 90): "
I'm glad to s[...]",
Span (91, 100): "
Morning,"]
Creating necessary methods of the class¶
The next method to define will be the fit()
method. For each Category, we call exclusive_first_page_strings
method,
which allows us to gather the strings that appear on the first Page of each Document. allow_empty_categories
allows
for returning empty lists for Categories that haven’t had any exclusive first-page strings found across their Documents.
This means that those Categories would not be used in the prediction process.
def fit(self, allow_empty_categories: bool = False, *args, **kwargs):
for category in self.categories:
cur_first_page_strings = category.exclusive_first_page_strings(tokenizer=self.tokenizer)
if not cur_first_page_strings:
if allow_empty_categories:
logger.warning(
f'No exclusive first-page strings were found for {category}, so it will not be used '
f'at prediction.'
)
else:
raise ValueError(f'No exclusive first-page strings were found for {category}.')
Then, we define predict()
method. The method accepts a Page as an input and checks its Span set for containing
first-page strings for each of the Categories. If there is at least one intersection, the Page is predicted to be a
first Page. If there are no intersections, the Page is predicted to be a non-first Page.
def predict(self, page: Page) -> Page:
self.check_is_ready()
page.is_first_page = False
for category in self.categories:
cur_first_page_strings = category.exclusive_first_page_strings(tokenizer=self.tokenizer)
intersection = {span.offset_string.strip('\f').strip('\n') for span in page.spans()}.intersection(
cur_first_page_strings
)
if len(intersection) > 0:
page.is_first_page = True
break
page.is_first_page_confidence = 1
return page
Lastly, a check_is_ready()
method is defined. This method is used to ensure that a model is ready for prediction: the
checks cover that the Tokenizer and a set of Categories is defined, and that at least one of the Categories has
exclusive first-page strings.
def check_is_ready(self):
if self.tokenizer is None:
raise AttributeError(f'{self} missing Tokenizer.')
if not self.categories:
raise AttributeError(f'{self} requires Categories.')
empty_first_page_strings = [
category
for category in self.categories
if not category.exclusive_first_page_strings(tokenizer=self.tokenizer)
]
if len(empty_first_page_strings) == len(self.categories):
raise ValueError(
f"Cannot run prediction as none of the Categories in {self.project} have "
f"_exclusive_first_page_strings."
)
Conclusion¶
In this tutorial, we have walked through the essential steps for constructing the Context-Aware File Splitting Model. Below is the full code of the class:
import logging
from typing import List
from konfuzio_sdk.data import Page, Category
from konfuzio_sdk.trainer.file_splitting import AbstractFileSplittingModel
logger = logging.getLogger(__name__)
class ContextAwareFileSplittingModel(AbstractFileSplittingModel):
def __init__(self, categories: List[Category], tokenizer, *args, **kwargs):
super().__init__(categories=categories)
self.output_dir = self.project.model_folder
self.tokenizer = tokenizer
self.requires_text = True
self.requires_images = False
def fit(self, allow_empty_categories: bool = False, *args, **kwargs):
for category in self.categories:
cur_first_page_strings = category.exclusive_first_page_strings(tokenizer=self.tokenizer)
if not cur_first_page_strings:
if allow_empty_categories:
logger.warning(
f'No exclusive first-page strings were found for {category}, so it will not be used '
f'at prediction.'
)
else:
raise ValueError(f'No exclusive first-page strings were found for {category}.')
def predict(self, page: Page) -> Page:
self.check_is_ready()
page.is_first_page = False
for category in self.categories:
cur_first_page_strings = category.exclusive_first_page_strings(tokenizer=self.tokenizer)
intersection = {span.offset_string.strip('\f').strip('\n') for span in page.spans()}.intersection(
cur_first_page_strings
)
if len(intersection) > 0:
page.is_first_page = True
break
page.is_first_page_confidence = 1
return page
def check_is_ready(self):
if self.tokenizer is None:
raise AttributeError(f'{self} missing Tokenizer.')
if not self.categories:
raise AttributeError(f'{self} requires Categories.')
empty_first_page_strings = [
category
for category in self.categories
if not category.exclusive_first_page_strings(tokenizer=self.tokenizer)
]
if len(empty_first_page_strings) == len(self.categories):
raise ValueError(
f"Cannot run prediction as none of the Categories in {self.project} have "
f"_exclusive_first_page_strings."
)