File Splitting¶
PDFs often encapsulate multiple distinct Documents within a single file, leading to complex navigation and information retrieval. Document splitting tackles this by disentangling these intertwined files into separate Documents. This guide introduces you to tools and models that automate this process, streamlining your work with multi-Document PDFs. Note that File Splitting always happens on a Project level.
Overview¶
You can train your own File Splitting AI on the data from any Project of your choice (data preparation tutorial here). Note that Pages in all the Documents used for training and testing have to be ordered correctly – that is to say, not mixed up in order. The ground-truth first Page of each Document should go first in the file, ground-truth second Page goes second and so on. This is needed because the Splitting AI operates on the idea that the splitting points in a stream of Pages are the starting Pages of each Sub-Document in the stream.
For that purpose, there are several tools in the SDK that enable processing Documents that consist of multiple files and propose splitting them into the Sub-Documents accordingly:
A Context Aware File Splitting Model uses a simple hands-on logic based on scanning Category’s Documents and finding strings exclusive for first Pages of all Documents within the Category. Upon predicting whether a Page is a potential splitting point (meaning whether it is first or not), we compare Page’s contents to these exclusive first-page strings; if there is occurrence of at least one such string, we mark a Page to be first (thus meaning it is a splitting point). An instance of the Context Aware File Splitting Model can be used to initially build a File Splitting pipeline and can later be replaced with more complex solutions.
A Context Aware File Splitting Model instance can be used with an interface provided by Splitting AI – this class accepts a whole Document instead of a single Page and proposes splitting points or splits the original Documents.
A Multimodal File Splitting Model is a model that uses an approach that takes both visual and textual parts of the Pages and processes them independently via the combined VGG19 architecture (simplified) and LegalBERT, passing the resulting outputs together to a Multi-Layered Perceptron. Model’s output is also a prediction of a Page being first or non-first.
For developing a custom File Splitting approach, we propose an abstract class AbstractFileSplittingModel
.
Train a Context Aware File Splitting AI¶
Let’s see how to use the konfuzio_sdk
to automatically split a file into several Documents. We will be using
a pre-built class SplittingAI
and an instance of a trained ContextAwareFileSplittingModel
. The latter uses a
context-aware logic. By context-aware we mean a rule-based approach that looks for common strings between the first
Pages of all Category’s Documents. Upon predicting whether a Page is a potential splitting point (meaning whether it is
first or not), we compare Page’s contents to these common first-page strings; if there is occurrence of at least one
such string, we mark a Page to be first (thus meaning it is a splitting point).
from konfuzio_sdk.data import Page, Category, Project
from konfuzio_sdk.trainer.file_splitting import SplittingAI
from konfuzio_sdk.trainer.file_splitting import ContextAwareFileSplittingModel
from konfuzio_sdk.tokenizer.regex import ConnectedTextTokenizer
# initialize a Project and fetch a test Document of your choice
project = Project(id_=YOUR_PROJECT_ID)
test_document = project.get_document_by_id(YOUR_DOCUMENT_ID)
# initialize a Context Aware File Splitting Model and fit it
file_splitting_model = ContextAwareFileSplittingModel(
categories=project.categories, tokenizer=ConnectedTextTokenizer()
)
# for an example run, you can take only a slice of training documents to make fitting faster
file_splitting_model.documents = file_splitting_model.documents[:10]
file_splitting_model.fit(allow_empty_categories=True)
# save the model
save_path = file_splitting_model.save(include_konfuzio=True)
# run the prediction and see its confidence
for page in test_document.pages():
pred = file_splitting_model.predict(page)
if pred.is_first_page:
print(
'Page {} is predicted as the first. Confidence: {}.'.format(page.number, page.is_first_page_confidence)
)
else:
print('Page {} is predicted as the non-first.'.format(page.number))
# usage with the Splitting AI – you can load a pre-saved model or pass an initialized instance as the input
# in this example, we load a previously saved one
model = ContextAwareFileSplittingModel.load_model(save_path)
# initialize the Splitting AI
splitting_ai = SplittingAI(model)
# Splitting AI is a more high-level interface to Context Aware File Splitting Model and any other models that can be
# developed for File Splitting purposes. It takes a Document as an input, rather than individual Pages, because it
# utilizes page-level prediction of possible split points and returns Document or Documents with changes depending
# on the prediction mode.
# Splitting AI can be run in two modes: returning a list of Sub-Documents as the result of the input Document
# splitting or returning a copy of the input Document with Pages predicted as first having an attribute
# "is_first_page". The flag "return_pages" has to be True for the latter; let's use it
new_document = splitting_ai.propose_split_documents(test_document, return_pages=True)
print(new_document)
# output: [predicted_document]
for page in new_document[0].pages():
if page.is_first_page:
print(
'Page {} is predicted as the first. Confidence: {}.'.format(page.number, page.is_first_page_confidence)
)
else:
print('Page {} is predicted as the non-first.'.format(page.number))
After you have trained and saved your custom AI, you can upload it using the steps from the tutorial
or using the method upload_ai_model()
.
For the first option, go to the Superuser AIs and select your locally stored pickle file, setting Model Type to Splitting and status to Training finished, then save the AI. After that, go to the Splitting AIs, choose your AI and select an action “Activate Splitting AI”.
For the second option, you can refer to Upload your AI.
Train a Multimodal File Splitting AI¶
The above tutorial for the ContextAwareFileSplittingModel
can also be used with the MultimodalFileSplittingModel
.
The only difference is that the MultimodalFileSplittingModel
does not need to be initialized with a Tokenizer.
Develop and save a Context-Aware File Splitting AI¶
If the solutions presented above do not meet your requirements, we also allow the training of custom File Splitting AIs on the data from a Project of your choice. You can then save your trained model and use it with Konfuzio.
Intro¶
It’s common for multi-paged files to not be perfectly organized, and in some cases, multiple independent Documents may be included in a single file. To ensure that these Documents are properly processed and separated, we will be discussing a method for identifying and splitting them into individual, independent Sub-documents.

Multi-file Document Example
Konfuzio SDK offers two ways for separating Documents that may be included in a single file. One of them is training the instance of the Multimodal File Splitting Model for file splitting that would predict whether a Page is first or not and running the Splitting AI with it. Multimodal File Splitting Model is a combined approach based on architecture that processes textual and visual data from the Documents separately (in our case, using BERT and VGG19 simplified architectures respectively) and then combines the outputs which go into a Multi-layer Perceptron architecture as inputs. A more detailed scheme of the architecture can be found further.
If you hover over the image you can zoom or use the full page mode.
Another approach is context-aware file splitting logic which is presented by Context Aware File Splitting Model. This approach involves analyzing the contents of each Page and identifying similarities to the first Pages of the Document. It will allow us to define splitting points and divide the Document into multiple Sub-documents. It’s important to note that this approach is only effective for Documents written in the same language and that the process must be repeated for each Category. In this tutorial, we will explain how to implement the class for this model step by step.
If you are unfamiliar with the SDK’s main concepts (like Page or Span), you can get to know them at Data Layer Concepts.
Quick explanation¶
The first step in implementing this method is “training”: this involves tokenizing the Document by splitting its text into parts, specifically into strings without line breaks. We then gather the exclusive strings from Spans, which are the parts of the text in the Page, and compare them to the first Pages of each Document in the training data.
Once we have identified these strings, we can use them to determine whether a Page in an input Document is a first Page or not. We do this by going through the strings in the Page and comparing them to the set of strings collected in the training stage. If we find at least one string that intersects between the current Page and the strings from the first step, we believe it is the first Page.
Note that the more Documents we use in the training stage, the less intersecting strings we are likely to find. If you find that your set of first-page strings is empty, try using a smaller slice of the dataset instead of the whole set. Generally, when used on Documents within the same Category, this algorithm should not return an empty set. If that is the case, it’s worth checking if your data is consistent, for example, not in different languages or containing other Categories.
Step-by-step explanation¶
In this section, we will walk you through the process of setting up the ContextAwareFileSplittingModel
class, which
can be found in the code block at the bottom of this page. This class is already implemented and can be imported using
from konfuzio_sdk.trainer.file_splitting import ContextAwareFileSplittingModel
.
Note that any custom File Splitting AI (derived from AbstractFileSplittingModel
class) requires having the following
methods implemented:
__init__
to initialize key variables required by the custom AI;fit
to define architecture and training that the model undergoes, i.e. a certain NN architecture or a customhardcoded logic;
predict
to define how the model classifies Pages as first or non-first. NB: the classification needs to be run on the Page level, not the Document level – the result of classification is reflected inis_first_page
attribute value, which is unique to the Page class and is not present in Document class. Pages withis_first_page = True
become splitting points, thus, each new Sub-Document has a Page predicted as first as its starting point.
To begin, we will make all the necessary imports:
from konfuzio_sdk.data import Page, Category, Project
from konfuzio_sdk.trainer.file_splitting import SplittingAI
from konfuzio_sdk.trainer.file_splitting import ContextAwareFileSplittingModel
from konfuzio_sdk.tokenizer.regex import ConnectedTextTokenizer
Then, let’s initialize the ContextAwareFileSplittingModel
class:
The class inherits from AbstractFileSplittingModel
, so we run super().__init__(categories=categories)
to properly
inherit its attributes. The tokenizer
attribute will be used to process the text within the Document, separating it
into Spans. This is done to ensure that the text in all the Documents is split using the same logic (particularly
tokenization by separating on \n
whitespaces by ConnectedTextTokenizer, which is used in the example in the end of the
page) and it will be possible to find common Spans. It will be used for training and testing Documents as well as any
Document that will undergo splitting. It’s important to note that if you run fitting with one Tokenizer and then
reassign it within the same instance of the model, all previously gathered strings will be deleted and replaced by new
ones. requires_images
and requires_text
determine whether these types of data are used for prediction; this is
needed for distinguishing between preprocessing types once a model is passed into the Splitting AI.
An example of how ConnectedTextTokenizer works:
# before tokenization
test_document = project.get_document_by_id(YOUR_DOCUMENT_ID)
test_document.text
# output: "Hi all,\nI like bread.\n\fI hope to get everything done soon.\n\fMorning,\n\fI'm glad to see you."
# "\n\fMorning,"
test_document.spans()
# output: []
test_document = tokenizer.tokenize(test_document)
# after tokenization
test_document.spans()
# output: [Span (0, 7), Span (8, 21), Span (22, 58), Span (59, 68), Span (69, 90), Span (91, 100)]
test_document.spans()[0].offset_string
# output: "Hi all,"
The first method to define will be the fit()
method. For each Category, we call exclusive_first_page_strings
method,
which allows us to gather the strings that appear on the first Page of each Document. allow_empty_categories
allows
for returning empty lists for Categories that haven’t had any exclusive first-page strings found across their Documents.
This means that those Categories would not be used in the prediction process.
Next, we define predict()
method. The method accepts a Page as an input and checks its Span set for containing
first-page strings for each of the Categories. If there is at least one intersection, the Page is predicted to be a
first Page. If there are no intersections, the Page is predicted to be a non-first Page.
Lastly, a check_is_ready()
method is defined. This method is used to ensure that a model is ready for prediction: the
checks cover that the Tokenizer and a set of Categories is defined, and that at least one of the Categories has
exclusive first-page strings.
Full code of class:
A quick example of the class’s usage:
# initialize a Project and fetch a test Document of your choice
project = Project(id_=YOUR_PROJECT_ID)
test_document = project.get_document_by_id(YOUR_DOCUMENT_ID)
# initialize a Context Aware File Splitting Model and fit it
file_splitting_model = ContextAwareFileSplittingModel(
categories=project.categories, tokenizer=ConnectedTextTokenizer()
)
# for an example run, you can take only a slice of training documents to make fitting faster
file_splitting_model.documents = file_splitting_model.documents[:10]
file_splitting_model.fit(allow_empty_categories=True)
# save the model
save_path = file_splitting_model.save(include_konfuzio=True)
# run the prediction and see its confidence
for page in test_document.pages():
pred = file_splitting_model.predict(page)
if pred.is_first_page:
print(
'Page {} is predicted as the first. Confidence: {}.'.format(page.number, page.is_first_page_confidence)
)
else:
print('Page {} is predicted as the non-first.'.format(page.number))
# usage with the Splitting AI – you can load a pre-saved model or pass an initialized instance as the input
# in this example, we load a previously saved one
model = ContextAwareFileSplittingModel.load_model(save_path)
# initialize the Splitting AI
splitting_ai = SplittingAI(model)
# Splitting AI is a more high-level interface to Context Aware File Splitting Model and any other models that can be
# developed for File Splitting purposes. It takes a Document as an input, rather than individual Pages, because it
# utilizes page-level prediction of possible split points and returns Document or Documents with changes depending
# on the prediction mode.
# Splitting AI can be run in two modes: returning a list of Sub-Documents as the result of the input Document
# splitting or returning a copy of the input Document with Pages predicted as first having an attribute
# "is_first_page". The flag "return_pages" has to be True for the latter; let's use it
new_document = splitting_ai.propose_split_documents(test_document, return_pages=True)
print(new_document)
# output: [predicted_document]
for page in new_document[0].pages():
if page.is_first_page:
print(
'Page {} is predicted as the first. Confidence: {}.'.format(page.number, page.is_first_page_confidence)
)
else:
print('Page {} is predicted as the non-first.'.format(page.number))
Create a custom File Splitting AI¶
This section explains how to train a custom File Splitting AI locally, how to save it and upload it to the Konfuzio Server. If you run this tutorial in Colab and experience any version compatibility issues when working with the SDK, restart the runtime and initialize the SDK once again; this will resolve the issue.
Note: you don’t necessarily need to create the AI from scratch if you already have some document-processing architecture. You just need to wrap it into the class that corresponds to our File Splitting AI structure. Follow the steps in this tutorial to find out what are the requirements for that.
By default, any File Splitting AI class should derive from the
AbstractFileSplittingModel
class and implement the following methods:
from konfuzio_sdk.trainer.file_splitting import AbstractFileSplittingModel
from konfuzio_sdk.data import Page, Category
from typing import List
class CustomFileSplittingModel(AbstractFileSplittingModel):
def __init__(self, categories: List[Category], *args, **kwargs):
# we need Categories because we define the split points based off on the Documents within certain Categories
super().__init__(categories)
pass
# initialize key variables required by the custom AI
# for instance, self.categories to determine which Categories will be used for training the AI, self.documents
# and self.test_documents to define training and testing Documents, self.tokenizer for a Tokenizer that will
# be used in processing the Documents
def fit(self):
pass
# Define architecture and training that the model undergoes, i.e. a NN architecture or a custom hardcoded logic
# for instance, how it is done in ContextAwareFileSplittingModel:
#
# for category in self.categories:
# cur_first_page_strings = category.exclusive_first_page_strings(tokenizer=self.tokenizer)
#
# This method does not return anything; rather, it modifies the self.model if you provide this attribute.
#
# This method is allowed to be implemented as a no-op if you provide the trained model in other ways
def predict(self, page: Page) -> Page:
pass
# Define how the model determines a split point for a Page, for instance, how it is implemented in
# ContextAwareFileSplittingModel:
#
# for category in self.categories:
# cur_first_page_strings = category.exclusive_first_page_strings(tokenizer=self.tokenizer)
# intersection = {span.offset_string.strip('\f').strip('\n') for span in page.spans()}.intersection(
# cur_first_page_strings
# )
# if len(intersection) > 0:
# page.is_first_page = True
# break
#
# **NB:** The classification needs to be run on the Page level, not the Document level – the result of
# classification is reflected in `is_first_page` attribute value, which is unique to the Page class and is not
# present in Document class. Pages with `is_first_page = True` become potential splitting points, thus, each new
# sub-Document has a Page predicted as first as its starting point.
def check_is_ready(self) -> bool:
pass
# define if all components needed for training/prediction are set, for instance, is self.tokenizer set or are
# all Categories non-empty – containing training and testing Documents.
Evaluate a File Splitting AI¶
FileSplittingEvaluation
class can be used to evaluate performance of Context-Aware File Splitting Model, returning a
set of metrics that includes precision, recall, f1 measure, True Positives, False Positives, True Negatives, and False
Negatives.
The class’s methods calculate()
and calculate_by_category()
are run at initialization. The class receives two lists
of Documents as an input – first list consists of ground-truth Documents where all first Pages are marked as such,
second is of Documents on Pages of which File Splitting Model ran a prediction of them being first or non-first.
The initialization would look like this:
evaluation = FileSplittingEvaluation(
ground_truth_documents=YOUR_GROUND_TRUTH_LIST, prediction_documents=YOUR_PREDICTION_LIST
)
The class compares each pair of Pages. If a Page is labeled as first and the model also predicted it as first, it is considered a True Positive. If a Page is labeled as first but the model predicted it as non-first, it is considered a False Negative. If a Page is labeled as non-first but the model predicted it as first, it is considered a False Positive. If a Page is labeled as non-first and the model also predicted it as non-first, it is considered a True Negative.
predicted correctly |
predicted incorrectly |
|
---|---|---|
first Page |
TP |
FN |
non-first Page |
TN |
FP |
After iterating through all Pages of all Documents, precision, recall and f1 measure are calculated. If you wish to set
metrics to None
in case there has been an attempt of zero division, set allow_zero=True
at the initialization.
To see a certain metric after the class has been initialized, you can call a metric’s method:
print(evaluation.fn())
It is also possible to look at the metrics calculated by each Category independently. For this, pass
search=YOUR_CATEGORY_HERE
when calling the wanted metric’s method:
print(evaluation.fn(search=YOUR_CATEGORY))
For more details, see the Python API Documentation on Evaluation.
Example of evaluation input and output¶
Suppose in our test dataset we have 2 Documents of 2 Categories: one 3-paged, consisting of a single file (-> it has only one ground-truth first Page) of a first Category, and one 5-paged, consisting of three files: two 2-paged and one 1-paged (-> it has three ground-truth first Pages), of a second Category.

First document

Second document
from konfuzio_sdk.data import Document, Page
from konfuzio_sdk.evaluate import FileSplittingEvaluation, EvaluationCalculator
from konfuzio_sdk.trainer.file_splitting import SplittingAI
# This example builds the Documents from scratch and without uploading a Supported File.
# If you uploaded your Document to the Konfuzio Server, you can just retrieve it with:
# document_1 = project.get_document_by_id(YOUR_DOCUMENT_ID)
text_1 = "Hi all,\nI like bread.\nI hope to get everything done soon.\nHave you seen it?"
document_1 = Document(id_=20, project=YOUR_PROJECT, category=YOUR_CATEGORY_1, text=text_1, dataset_status=3)
_ = Page(
id_=None, original_size=(320, 240), document=document_1, start_offset=0, end_offset=21, number=1, copy_of_id=29
)
_ = Page(
id_=None, original_size=(320, 240), document=document_1, start_offset=22, end_offset=57, number=2, copy_of_id=30
)
_ = Page(
id_=None, original_size=(320, 240), document=document_1, start_offset=58, end_offset=75, number=3, copy_of_id=31
)
# As with the previous example Document, you can just retrieve an online Document with
# document_2 = project.get_document_by_id(YOUR_DOCUMENT_ID)
text_2 = "Evening,\nthank you for coming.\nI like fish.\nI need it.\nEvening."
document_2 = Document(id_=21, project=YOUR_PROJECT, category=YOUR_CATEGORY_2, text=text_2, dataset_status=3)
_ = Page(
id_=None, original_size=(320, 240), document=document_2, start_offset=0, end_offset=8, number=1, copy_of_id=32
)
_ = Page(
id_=None, original_size=(320, 240), document=document_2, start_offset=9, end_offset=30, number=2, copy_of_id=33
)
_ = Page(
id_=None, original_size=(320, 240), document=document_2, start_offset=31, end_offset=43, number=3, copy_of_id=34
)
_.is_first_page = True
_ = Page(
id_=None, original_size=(320, 240), document=document_2, start_offset=44, end_offset=54, number=4, copy_of_id=35
)
_ = Page(
id_=None, original_size=(320, 240), document=document_2, start_offset=55, end_offset=63, number=5, copy_of_id=36
)
_.is_first_page = True
We need to pass two lists of Documents into the FileSplittingEvaluation
class. So, before that, we need to run each
Page of the Documents through the model’s prediction.
Let’s say the evaluation gave good results, with only one first Page being predicted as non-first and all the other Pages being predicted correctly. An example of how the evaluation would be implemented would be:
splitting_ai = SplittingAI(YOUR_MODEL)
pred_1: Document = splitting_ai.propose_split_documents(document_1, return_pages=True)[0]
pred_2: Document = splitting_ai.propose_split_documents(document_2, return_pages=True)[0]
evaluation = FileSplittingEvaluation(
ground_truth_documents=[document_1, document_2], prediction_documents=[pred_1, pred_2]
)
print(evaluation.tp())
# returns: 3
print(evaluation.tn())
# returns: 4
print(evaluation.fp())
# returns: 0
print(evaluation.fn())
# returns: 1
print(evaluation.precision())
# returns: 1
print(evaluation.recall())
# returns: 0.75
print(evaluation.f1())
# returns: 0.85
Our results could be reflected in a following table:
TPs |
TNs |
FPs |
FNs |
precision |
recall |
F1 |
---|---|---|---|---|---|---|
3 |
4 |
0 |
1 |
1 |
0.75 |
0.85 |
If we want to see evaluation results by Category, the implementation of the Evaluation would look like this:
print(evaluation.tp(search=YOUR_CATEGORY_1), evaluation.tp(search=YOUR_CATEGORY_2))
# returns: 1 2
print(evaluation.tn(search=YOUR_CATEGORY_1), evaluation.tn(search=YOUR_CATEGORY_2))
# returns: 2 2
print(evaluation.fp(search=YOUR_CATEGORY_1), evaluation.fp(search=YOUR_CATEGORY_2))
# returns: 0 0
print(evaluation.fn(search=YOUR_CATEGORY_1), evaluation.fn(search=YOUR_CATEGORY_2))
# returns: 0 1
print(evaluation.precision(search=YOUR_CATEGORY_1), evaluation.precision(search=YOUR_CATEGORY_2))
# returns: 1 1
print(evaluation.recall(search=YOUR_CATEGORY_1), evaluation.recall(search=YOUR_CATEGORY_2))
# returns: 1 0.66
print(evaluation.f1(search=YOUR_CATEGORY_1), evaluation.f1(search=YOUR_CATEGORY_2))
# returns: 1 0.8
the output could be reflected in a following table:
Category |
TPs |
TNs |
FPs |
FNs |
precision |
recall |
F1 |
---|---|---|---|---|---|---|---|
Category 1 |
1 |
2 |
0 |
0 |
1 |
1 |
1 |
Category 2 |
2 |
2 |
0 |
1 |
1 |
0.66 |
0.8 |
To log metrics after evaluation, you can call EvaluationCalculator
‘s method metrics_logging
(you would need to
specify the metrics accordingly at the class’s initialization). Example usage:
EvaluationCalculator(tp=3, fp=0, fn=1, tn=4).metrics_logging()