The functionalities of the Trainer module are not yet available in the SDK.
LabelSectionModel Examples¶

A LabelSectionModel
is a model that takes in a Document
and predicts a Label
per token and a SectionLabel
for each line in the document.
Training our first LabelSectionModel¶
A LabelSectionModel
contains both a LabelClassifier
and SectionClassifier
. Both classifiers have set default modules and training is performed with a set of default hyperparameters. The build
method returns metrics for each classifier, a dictionary of lists with the loss and accuracy values per batch for training/evaluation, which can be used e.g. visualization.
from konfuzio.data import Project
from konfuzio.default_models import LabelSectionModel
# load the project
project = Project(id_=YOUR_PROJECT_ID)
# create a default label section model from the project
model = LabelSectionModel(project)
# build (i.e. train) the label section model
label_classifier_metrics, section_classifier_metrics = model.build()
# save the trained label section model
model.save()
Customizing the LabelSectionModel¶
We can also control the tokenization method and the hyperparameters of the LabelSectionModel
. We change the tokenization from the default whitespace tokenization to BPE (byte-pair encoding) tokenization. The LabelClassifier
now has a dropout of 0.25 and contains a 2-layer unidirectional LSTM module. The SectionClassifier
now has a dropout of 0.5 and contains an NBOW module with a 64-dimensional embedding layer. Training both classifiers is still performed with the default training hyperparameters.
from konfuzio.data import Project
from konfuzio.default_models import LabelSectionModel
from konfuzio.tokenizers import BPETokenizer
project = Project(id_=YOUR_PROJECT_ID)
# specify a different tokenizer
tokenizer = BPETokenizer()
# configuration dict for the label classifier
label_classifier_config = {'dropout_rate': 0.25,
'text_module': {'name': 'lstm',
'n_layers': 2,
'bidirectional': False}}
# configuation dict for the section classifier
section_classifier_config = {'dropout_rate': 0.5,
'text_module': {'name': 'nbow',
'emb_dim': 64,}}
# create label section model with chosen tokenizer and classifier configs
model = LabelSectionModel(project,
tokenizer=tokenizer,
label_classifier_config=label_classifier_config,
section_classifier_config=section_classifier_config)
# build the label section model with default training hyperparameters
model.build()
# save the trained label section model
model.save()
Setting the LabelSectionModel training hyperparameters¶
We’ll now use the default classifier hyperparameters but customize the training hyperparameters. An example of hyperparameters that can be changed: validation ratio, batch size, number of epochs, patience, optimizer, learning rate decay.
from konfuzio.data import Project
from konfuzio.default_models import LabelSectionModel
# load the project
project = Project(id_=YOUR_PROJECT_ID)
# create a default label section model from the project
model = LabelSectionModel(project)
label_training_config = {'valid_ratio': 0.15, # what percentage of training data should be used to create validation data
'batch_size': 64, # size of batches used for training
'seq_len': 100, # number of sequential tokens to predict over
'n_epochs': 100, # number of epochs to train for
'patience': 0, # number of epochs without improvement in validation loss before we stop training
'optimizer': {'name': 'Adam', 'lr': 1e-5}} # optimizer hyperparameters
section_training_config = {'valid_ratio': 0.2,
'batch_size': 128,
'max_len': 100, # maximum tokens per line to consider
'n_epochs': 50,
'patience': 3,
'optimizer': {'name': 'RMSprop', 'lr': 1e-3, 'momentum': 0.9},
'lr_decay': 0.9} # if validation loss does not improve, multiply learning rate by this value
# build model with training configs
model.build(label_training_config=label_training_config,
section_training_config=section_training_config)
# save the trained label section model
model.save()
Customizing the LabelSectionModel model and training hyperparameters¶
Customizing both the LabelSectionModel
hyperparameters and the training hyperparameters. This example combines the two above examples.
from konfuzio.data import Project
from konfuzio.default_models import LabelSectionModel
from konfuzio.tokenizers import BPETokenizer
# load the project
project = Project(id_=YOUR_PROJECT_ID)
# specify a different tokenizer
tokenizer = BPETokenizer()
# configuration dict for the label classifier
label_classifier_config = {'dropout_rate': 0.25,
'text_module': {'name': 'lstm',
'n_layers': 2,
'bidirectional': False}}
# configuation dict for the section classifier
section_classifier_config = {'dropout_rate': 0.5,
'text_module': {'name': 'nbow',
'emb_dim': 64,}}
# create label section model with chosen tokenizer and classifier configs
model = LabelSectionModel(project,
tokenizer=tokenizer,
label_classifier_config=label_classifier_config,
section_classifier_config=section_classifier_config)
label_training_config = {'valid_ratio': 0.15, # what percentage of training data should be used to create validation data
'batch_size': 64, # size of batches used for training
'seq_len': 100, # number of sequential tokens to predict over
'n_epochs': 100, # number of epochs to train for
'patience': 0, # number of epochs without improvement in validation loss before we stop training
'optimizer': {'name': 'Adam', 'lr': 1e-5}} # optimizer hyperparameters
section_training_config = {'valid_ratio': 0.2,
'batch_size': 128,
'max_len': 100, # maximum tokens per line to consider
'n_epochs': 50,
'patience': 3,
'optimizer': {'name': 'RMSprop', 'lr': 1e-3, 'momentum': 0.9},
'lr_decay': 0.9} # if validation loss does not improve, multiply learning rate by this value
# build model with training configs
model.build(label_training_config=label_training_config,
section_training_config=section_training_config)
# save the trained label section model
model.save()
Implementing a custom LabelSectionModel training loop¶
The build
method of LabelSectionModel
calls self.build_label_classifier
and self.build_section_classifier
. Both of which call a generic self.fit_classifier
for both the label and section classifiers. If we want to customize the fit_classifier
method - e.g. change the way the classifiers are trained based on some specific criteria, such as using custom loss function - then we can override the fit_classifier
method. The custom fit_classifier
method should take in the train, valid and test examples, the classifier, and any configuration arguments.
from konfuzio.data import Project
from konfuzio.default_models import LabelSectionModel
class CustomerSpecificModel(LabelSectionModel):
def fit_classifier(self, train_examples: DataLoader, valid_examples: DataLoader, test_examples: DataLoader, classifier: Classifier, **kwargs) -> Dict[str, List[float]]:
# new code must take in train/valid/test examples, the classifier model, and any configuration arguments supplied as kwargs from the config dict
# custom code goes here
return metrics
# load the project
project = Project(id_=YOUR_PROJECT_ID)
# can also use a custom tokenizer and model config here
model = CustomerSpecificModel(project)
# example custom configuration dicts
label_training_config = {'custom_loss_function_arg': 123}
section_training_config = {'custom_loss_function_arg': 999}
# build model with training configs
label_classifier_metrics, section_classifier_metrics = model.build(label_training_config,
section_training_config)
# save the trained label section model
model.save()
Implementing a custom LabelSectionModel classifier training loop¶
By default, both classifiers use the same generic fit_classifier
function. If we want each to have their own custom fit_classifier
function then we can do so by overwriting the build_label_classifier
/build_section_classifier
functions and implementing a custom fit_label_classifier
/fit_section_classifier
function within them.
We can use existing functions to get the data iterators and then using our custom fit_label_classifier
/fit_section_classifier
functions in place of the generic fit_classifier
function.
We could also customize the format of the training data by writing our own get_label_classifier_iterators
/get_section_classifier_iterators
functions. These must return a PyTorch DataLoader
for the training, validation, and test sets, and must be compatible with the custom fit_label_classifier
/fit_section_classifier
functions.
Below is an example of how to implement our own fit_label_classifier
function.
from konfuzio.data import Project
from konfuzio.default_models import LabelSectionModel
from konfuzio.default_models.utils import get_label_classifier_iterators
from konfuzio.default_models.utils import get_section_classifier_iterators
class CustomerSpecificModel(LabelSectionModel):
def fit_label_classifier(self, train_examples: DataLoader, valid_examples: DataLoader, test_examples: DataLoader, classifier: Classifier, **kwargs) -> Dict[str, List[float]]:
# custom code goes here.
return metrics
def build_label_classifier(self, label_training_config: dict = {}) -> Dict[str, List[float]]:
# get the iterators over examples for the label classifier
examples = get_label_classifier_iterators(self.projects,
self.tokenizer,
self.text_vocab,
self.label_vocab,
**label_training_config)
# unpack the examples
train_examples, valid_examples, test_examples = examples
# place label classifier on device
self.label_classifier = self.label_classifier.to(self.device)
# now uses our custom fit_label_classifier instead of generic fit_classifier
label_classifier_metrics = self.fit_label_classifier(train_examples,
valid_examples,
test_examples,
self.label_classifier,
**label_training_config)
# put label classifier back on cpu to free up GPU memory
self.label_classifier = self.label_classifier.to('cpu')
return label_classifier_metrics
# load the project
project = Project(id_=YOUR_PROJECT_ID)
# define model with custom build_label_classifier function
model = CustomerSpecificModel(project)
# specify any custom training hyperparameters
label_training_config = {'custom_loss_function_arg': 123}
# build model with training configs
label_classifier_metrics, section_classifier_metrics = model.build(label_training_config,
section_training_config)
# save the trained model
model.save()
DocumentModel Examples¶

A DocumentModel
takes pages as input and predicts the “category” (the project ID) for that page. It can use both image features (from a .png image of the page) and text features (from the OCR text from the page).
Training our first DocumentModel¶
A DocumentModel
contains a DocumentClassifier
. Similar to the LabelSectionModel
, it has a set of default model hyperparameters and default training hyperparameters.
from konfuzio.data import Project
from konfuzio.default_models import DocumentModel
# need to write `get_project` ourselves
projects = [get_project(project_id) for project_id in project_ids]
# create default document model from a list of projects
model = DocumentModel(projects)
# build (i.e. train) the document model
document_classifier_metrics = model.build()
# save the document model
model.save()
Customizing the DocumentModel¶
Below we show how to implement a custom tokenizer, image preprocessing, image augmentation, and classifier. Image preprocessing is applied to images for training, evaluation, and inference. Image augmentation is only applied during training. We only need a multimodal_module
when we use both text and image modules.
from konfuzio.data import Project
from konfuzio.default_models import DocumentModel
from konfuzio.tokenizers import BPETokenizer
# need to write `get_project` ourselves
projects = [get_project(project_id) for project_id in project_ids]
# specify a custom tokenizer
tokenizer = BPETokenizer()
# specify how images should be preprocessed
image_preprocessing = {'target_size': (1000, 1000),
'grayscale': True}
# specify how images should be augmented during training
image_augmentation = {'rotate': 5}
# configuration dict for the document classifier
document_classifier_config = {'image_module': {'name': 'efficientnet_b0',
'freeze': False},
'text_module': {'name': 'lstm',
'n_layers': 2},
'multimodal_module': {'name': 'concatenate',
'hid_dim': 512}}
# create document model with chosen tokenizer and classifier configs
model = DocumentModel(projects,
tokenizer=tokenizer,
image_preprocessing=image_preprocessing,
image_augmentation=image_augmentation,
document_classifier_config=document_classifier_config)
# build (i.e. train) the document model
document_classifier_metrics = model.build()
# save the document model
model.save()
Classifying documents using image features only¶
To use a DocumentModel
that only uses the image of the document, simply do not include a text_module
or multimodal_module
in the classifier config. Passing a tokenizer when we have no text_module
will throw an error, as there should be no text to tokenizer, so we make sure to pass None
to the tokenizer
argument like so:
from konfuzio.data import Project
from konfuzio.default_models import DocumentModel
# need to write `get_project` ourselves
projects = [get_project(project_id) for project_id in project_ids]
# needs to ensure we do not use a tokenizer as we are not using a text_module
tokenizer = None
# specify how images should be preprocessed
image_preprocessing = {'target_size': (1000, 1000),
'grayscale': True}
# specify how images should be augmented during training
image_augmentation = {'rotate': 5}
# configuration dict for the document classifier
# no text_module AND no multimodal_module
document_classifier_config = {'image_module': {'name': 'efficientnet_b0',
'freeze': False}}
# create document model with chosen tokenizer and classifier configs
model = DocumentModel(projects,
tokenizer=tokenizer,
image_preprocessing=image_preprocessing,
image_augmentation=image_augmentation,
document_classifier_config=document_classifier_config)
# build (i.e. train) the document model
document_classifier_metrics = model.build()
# save the document model
model.save()
Classifying documents using text features only¶
To use a DocumentModel
that only uses the image of the document, simply do not include an image_module
or multimodal_module
in the classifier config. Passing an image_preprocessing
or image_augmentation
argument when we have no image_module
will throw an error so we need to ensure we pass None
to the image_preprocessing
and image_augmentation
arguments like so:
from konfuzio.data import Project
from konfuzio.default_models import DocumentModel
from konfuzio.tokenizers import BPETokenizer
# need to write `get_project` ourselves
projects = [get_project(project_id) for project_id in project_ids]
tokenizer = BPETokenizer()
# both should be None when not using an image_module
image_preprocessing = None
image_augmentation = None
# configuration dict for the document classifier
# no image_module AND no multimodal_module
document_classifier_config = {'text_module': {'name': 'lstm',
'n_layers': 2}}
# create document model with chosen tokenizer and classifier configs
model = DocumentModel(projects,
tokenizer=tokenizer,
image_preprocessing=image_preprocessing,
image_augmentation=image_augmentation,
document_classifier_config=document_classifier_config)
# build (i.e. train) the document model
document_classifier_metrics = model.build()
# save the document model
model.save()
Setting the DocumentModel training hyperparameters¶
Similar to the LabelSectionModel
we can customize the training config which will work with ANY classifier/module combination:
from konfuzio.data import Project
from konfuzio.default_models import DocumentModel
# need to write `get_project` ourselves
projects = [get_project(project_id) for project_id in project_ids]
# create a default document model
model = DocumentModel(projects)
# define the custom training hyperparameters
document_training_config = {'valid_ratio': 0.2,
'batch_size': 128,
'max_len': 100, # maximum tokens per page to consider, will do nothing if no text_module used
'n_epochs': 50,
'patience': 3,
'optimizer': {'name': 'RMSprop', 'lr': 1e-3, 'momentum': 0.9},
'lr_decay': 0.9}
# build (i.e. train) the document model with custom training hyperparameters
document_classifier_metrics = model.build(document_training_config=document_training_config)
# save the document model
model.save()
Implementing a custom DocumentModel training loop¶
We can also override the fit_classifier
method to define our method of training the document classifier. We can do a similar thing with overwriting build
to define our own custom data processing.
from konfuzio.data import Project
from konfuzio.default_models import DocumentModel
class CustomerSpecificModel(DocumentModel):
def fit_classifier(self, train_examples: DataLoader, valid_examples: DataLoader, test_examples: DataLoader, classifier: Classifier, **kwargs) -> Dict[str, float]:
# new code must take in train/valid/test examples, the classifier model, and kwargs from the config dict
# custom code goes here
return metrics
# need to write `get_project` ourselves
projects = [get_project(project_id) for project_id in project_ids]
# can also use a custom tokenizer and model config here
model = CustomerSpecificModel(project)
# custom fit_classifier config
custom_document_training_config = {'custom_loss_function_arg': 123}
# train document model with custom fit_classifier function
custom_document_classifier_metrics = model.build(document_training_config=custom_document_training_config)
# save a trained model
model.save()
ParagraphModel Examples¶

A ParagraphModel
takes the text of the pages as input and predicts the “category” for each paragraph in that text.
It uses text features (from the OCR text from the page).
Training our first ParagraphModel¶
A ParagraphModel
contains a ParagraphClassifier
.
Similar to the DocumentModel
, it has a set of default model hyperparameters and default training hyperparameters.
from konfuzio.data import Project
from konfuzio.default_models import ParagraphModel
# need to write `get_project` ourselves
projects = [get_project(project_id) for project_id in project_ids]
# create default paragraph model from a list of projects
model = ParagraphModel(projects)
# build (i.e. train) the paragraph model
paragraph_classifier_metrics = model.build()
# save the paragraph model
model.save()
Customizing the ParagraphModel¶
Below we show how to implement a custom tokenizer and classifier.
We only need a multimodal_module
when we use both text and image modules.
from konfuzio.data import Project
from konfuzio.default_models import DocumentModel
from konfuzio.tokenizers import BPETokenizer
# need to write `get_project` ourselves
projects = [get_project(project_id) for project_id in project_ids]
# specify a custom tokenizer
tokenizer = BPETokenizer()
# configuration dict for the paragraph classifier
paragraph_classifier_config = {'text_module': {'name': 'lstm',
'n_layers': 2}}
# create document model with chosen tokenizer and classifier configs
model = ParagraphModel(projects,
tokenizer=tokenizer,
document_classifier_config=document_classifier_config)
# build (i.e. train) the paragraph model
paragraph_classifier_metrics = model.build()
# save the paragraph model
model.save()
Setting the ParagraphModel training hyperparameters¶
Similar to the DocumentModel
we can customize the training config which will work with ANY classifier/module combination:
from konfuzio.data import Project
from konfuzio.default_models import ParagraphModel
# need to write `get_project` ourselves
projects = [get_project(project_id) for project_id in project_ids]
# create a default document model
model = ParagraphModel(projects)
# define the custom training hyperparameters
paragraph_training_config = {'valid_ratio': 0.2,
'batch_size': 128,
'max_len': 100, # maximum tokens per page to consider, will do nothing if no text_module used
'n_epochs': 50,
'patience': 3,
'optimizer': {'name': 'RMSprop', 'lr': 1e-3, 'momentum': 0.9},
'lr_decay': 0.9
'no_label_limit': 10}
# build (i.e. train) the paragraph model with custom training hyperparameters
paragraph_classifier_metrics = model.build(paragraph_training_config=paragraph_training_config)
# save the paragraph model
model.save()
Design Philosophy¶
A “model” encapsulates everything we need for training on a labeled dataset and then performing extraction (inference) on some real data.
Models contain “classifiers”, one for each task the model is performing. The
LabelSectionModel
has a label classifier for POS tagging of tokens within a document and a section classifier for labeling sections of a document. TheDocumentModel
has a document classifier for predicting the category of each page within a document. Each classifier is a PyTorchnn.Module
.Each classifier is made up of “modules”. These are the main core of the classifier and are usually some kind of neural network architecture. Modules are split into three different categories: image modules (e.g. VGG and EfficientNet), text (e.g. NBOW, LSTM, and Transformer), and multimodal (e.g. concatenation of image and text features). Each module is also a PyTorch
nn.Module
.Modules can contain other modules that can contain other modules - it’s modules all the way down!
The modules are designed to be agnostic to the classifier they are within, with the actual task-specific layers are contained in the classifier itself. Every text module takes in a sequence of tokens and outputs a sequence of tensors. Every image module takes in an image and outputs a tensor of features.
Let’s look at an example of a label classifier containing an LSTM module. The classifier takes in text, feeds it to the LSTM module which then calculates a hidden state per token within the text, and then returns this sequence of hidden states to the classifier which passes them through a linear layer to re-size them to the desired output dimensionality.
For a section classifier with an LSTM module: the classifier takes in the text, feeds it to the LSTM module which then returns the hidden states to the classifier, the classifier pools the hidden states and then passes them through a linear layer to make a prediction.
The document classifier with only a text module is the same as the section classifier. A document classifier with both a text and image module will calculate the text features and the image features, then pass both to a multimodal module that combines the two, performs some calculations to get multimodal features, and then passes these multimodal features to the classifier which makes a prediction.
Models are the glue holding everything together in a nice package. The most important model attributes are:
tokenizer
- a tokenizer that tokenizes text by converting a string to a list of strings.Vocabularies, which contain a mapping between a token (string) to an index (int), and vice versa. The types of vocabularies are:
text_vocab
- the vocabulary for the document textlabel_vocab
- the vocabulary for the annotation labels (only if using a label classifier)section_vocab
- the vocabulary for the section labels (only if using a section classifier)category_vocab
- the vocabulary for the document categories (only if using a document classifier)
Classifiers, which are specific to the desired task. Each classifier contains module(s). The types of classifiers are:
label_classifier
- an instance of aLabelClassifier
used to predict the annotation label for each token (only used in aLabelSectionModel
)section_classifier
- an instance of aSectionClassifier
used to predict the section label for each line in the document (only in aLabelSectionModel
)document_classifier
- used the predict the category of each page in the document (only inDocumentModel
) and can be one of three types:a
DocumentTextClassifier
which classifies a document’s page using only the text on that pagea
DocumentImageClassifier
which classifies a document’s page using only an image of the pagea
DocumentMultiModalClassifier
which classifiers a document’s page using both the text and image of that page
Configuration dictionaries, with one configuration dictionary per classifier used in the model, e.g.
label_classifier_config
. They are used to define the hyperparameters of the classifier and also used to load the correct classifier when loading the model. Training configuration dictionaries are not saved as a model attribute as the model should operate independently on how it is trained.image_preprocessing
- a dictionary that states how images should be pre-processed before being classified (only in DocumentModel)image_augmentation
- a dictionary that states how images should be augmented during the training of the classifier (only in DocumentModel)
Tokenizers¶
A tokenizer is a function that defines how a string should be separated into tokens (a list of strings), e.g.:
tokenizer.get_tokens('hello world') -> ['hello', 'world']
The konfuzio.tokenizers
module contains a few tokenizers which can either be directly imported, i.e. from konfuzio.tokenizers import WhitespaceTokenizer
or obtained using the get_tokenizer
function, i.e.:
from konfuzio.tokenizers import get_tokenizer
tokenizer = get_tokenizer('whitespace') # gets a WhitespaceTokenizer
from konfuzio.tokenizers import WhitespaceTokenizer
tokenizer = WhitespaceTokenzer() # same as above
All tokenizers have the following methods:
get_tokens
, takes in a string and returns a list of tokens (strings)get_entities
, same asget_tokens
but also contains the start and end character offsets for each token (represented as a list of dicts). This is usually slower thanget_tokens
, so should only be used if we explicitly need the character offsets.get_annotations
, same asget_entities
but converts each entity to anAnnotation
object and returns a list of annotations.
Currently available tokenizers:
WhitespaceTokenizer¶
from konfuzio.tokenizers import WhitespaceTokenizer
from konfuzio.tokenizers import get_tokenizer
tokenizer = get_tokenizer('whitespace')
Uses regular expressions to split a string based on whitespace. Very fast but naive method of tokenization.
SpacyTokenizer¶
from konfuzio.tokenizers import SpacyTokenizer
from konfuzio.tokenizers import get_tokenizer
tokenizer = get_tokenizer('spacy')
Tokenizes using the de_core_news_sm
spaCy model. Relatively slow.
PhraseMatcherTokenizer¶
from konfuzio.data import Project
from konfuzio.tokenizers import PhraseMatcherTokenizer
from konfuzio.tokenizers import get_tokenizer
project = Project(id_=YOUR_PROJECT_ID)
tokenizer = get_tokenizer('phrasematcher', project)
Note, this tokenizer also has to take in a Project
, or list of Project
, to build the PhraseMatcher
.
This builds a spaCy de_core_news_sm
phrase matcher using the annotation labels for each project and then uses this learned matching to tokenize data. We can think of this as learning a simple regex pattern matcher from the data. This is relatively slow, especially when the dataset is large.
BPETokenizer¶
from konfuzio.data import Project
from konfuzio.tokenizers import BPETokenizer
from konfuzio.tokenizers import get_tokenizer
tokenizer = get_tokenizer('bert-base-german-cased')
Gets a pre-trained byte-pair encoding tokenizer from the HuggingFace Transformers library. Officially we support four different variants of the BPETokenizer
(other variants should work, however they are not tested):
tokenizer = BPETokenizer('bert-base-german-cased')
tokenizer = BPETokenizer('bert-base-german-dbmdz-cased')
tokenizer = BPETokenizer('bert-base-german-dbmdz-uncased')
tokenizer = BPETokenizer('distilbert-base-german-cased')
By default, BPETokenizer
gets the bert-base-german-cased
variant.
# both of these get the same tokenizer
tokenizer = BPETokenizer()
tokenizer = BPETokenizer('bert-base-german-cased')
These tokenizers are special as they have their own custom vocabulary - accessed via tokenizer.vocab
- which is because they are designed to be used with a pre-trained Transformer model that must use the vocabulary it was trained with. Initializing a model with a BPETokenizer
automatically sets the text_vocab
to the tokenizer.vocab
when using a BPE tokenizer. These tokenizers still perform very well with non-Transformer models and are also the fastest tokenizers.
Vocabularies¶
A vocabulary is a mapping between tokens and integers.
A vocab is usually initialized with collections.Counter
- a dictionary where the keys are the tokens and the values are how many times that token appears in the training set. It can also be initialized with a dict
that has the keys being the tokens and the values being the integer representations, this is used to create a Vocab
object from an existing vocabulary already represented as a dictionary.
Two vocab arguments are max_size
and min_freq
. max_size
of 30,000 means that only the most common 30,000 tokens are used to create the vocabulary. min_freq
of 2 means that only tokens that appear at least twice are used to create the vocabulary.
Two other arguments are unk_token
and pad_token
. When trying to convert a token to an integer and the token is NOT in the vocabulary then this token is replaced by an unk_token
. If the unk_token
is set to None
then the vocabulary will throw an error when it tries to convert a token that is not in the vocabulary to an integer - this is usually used when created a vocabulary over the labels. A pad_token
is a token we will use for padding sequences, effectively a no-op, and can also be None
. If the unk_token
and pad_tokens
are not None
then the vocab will also have a unk_idx
and pad_idx
attribute which gets the integer value of the unk_token
and pad_token
- this is more for convenience than anything else.
The final argument is special_tokens
, a list of tokens that are guaranteed to appear in the vocabulary.
To convert from a token to an integer, use the stoi
(string to int) method, i.e. vocab.stoi('hello')
. To convert from an integer to a token, use the itos
(int to string) method, i.e. vocab.itos(123)
.
We can get the list of all the tokens within the vocab with vocab.get_tokens
, and a list of all integers with vocab.get_indexes
.
The LabelSectionModel
and DocumentModel
will create a text_vocab
with the provided tokenizer unless:
a
text_vocab
is provided, at which point the model will use that vocab and not create one from the data. This is usually done when loading a trained model.the tokenizer used has a
vocab
, where we also use that vocab and do not create one. Usually, the case when using a BPE tokenizer.
For the label_vocab
, section_vocab
, and category_vocab
, one is created from the data unless one is provided. Again, a provided vocab usually means we are loading a saved model.
Text Modules¶
There are currently four text modules available. Each module takes a sequence of tokens as input and outputs a sequence of “hidden states”, i.e. one vector per input token. The size of each of the hidden states can be found with the module’s n_features
parameter.
NBOW¶
The neural bag-of-words (NBOW) model is the simplest of models, it simply passes each token through an embedding layer. As shown in the fastText paper this model is still able to achieve comparable performance to some deep learning models whilst being considerably faster.
One downside of this model is that tokens are embedded without regards to the surrounding context in which they appear, e.g. the embedding for “May” in the two sentences “May I speak to you?” and “I am leaving on the 1st of May” are identical, even though they have different semantics.
Important arguments:
emb_dim
the dimensions of the embedding vectordropout_rate
the amount of dropout applied to the embedding vectors
NBOWSelfAttention¶
This is an NBOW model with a multi-headed self-attention layer, detailed here, added after the embedding layer. This effectively contextualizes the output as now each hidden state is now calculated from the embedding vector of a token and the embedding vector of all other tokens within the sequence.
Important arguments:
emb_dim
the dimensions of the embedding vectordropout_rate
the amount of dropout applied to the embedding vectorsn_heads
the number of attention heads to use in the multi-headed self-attention layer. Note thatn_heads
must be a factor ofemb_dim
, i.e.emb_dim % n_heads == 0
.
LSTM¶
The LSTM (long short-term memory) is a variant of a RNN (recurrent neural network). It feeds the input tokens through an embedding layer and then processes them sequentially with the LSTM, outputting a hidden state for each token. If the LSTM is bi-directional then it trains a forward and backward LSTM per layer and concatenates the forward and backward hidden states for each token.
Important arguments:
emb_dim
the dimensions of the embedding vectorhid_dim
the dimensions of the hidden statesn_layers
how many LSTM layers to usebidirectional
if the LSTM should be bidirectionaldropout_rate
the amount of dropout applied to the embedding vectors and between LSTM layers ifn_layers > 1
BERT¶
BERT (bi-directional encoder representations from Transformers) is a family of large Transformer models. The available BERT variants are all pre-trained models provided by the transformers library. It is usually infeasible to train a BERT model from scratch due to the significant amount of computation required. However, the pre-trained models can be easily fine-tuned on desired data.
Important arguments:
name
the name of the pre-trained BERT variant to usefreeze
should the BERT model be frozen, i.e. the pre-trained parameters are not updated
The BERT variants, i.e. name
arguments, that are covered by internal tests are:
'bert-base-german-cased'
'bert-base-german-dbmdz-cased'
'bert-base-german-dbmdz-uncased'
'distilbert-base-german-cased'
In theory, all variants beginning with bert-base-*
and distilbert-*
should work out of the box. Other BERT variants come with no guarantees.
Image Modules¶
We currently have two image modules available, each have several variants. The image models each have their classification heads removed and generally, they return the output of the final pooling layer within the model which has been flattened to a [batch_size, n_features]
tensor, where n_features
is an attribute of the model.
VGG¶
The VGG family of models are image classification models designed for the ImageNet. They are usually used as a baseline in image classification tasks, however are considerably larger - in terms of the number of parameters - than modern architectures.
Important arguments:
name
the name of the VGG variant to usepretrained
if pre-trained weights for the VGG variant should be usedfreeze
if the parameters of the VGG variant should be frozen
Available variants are: vgg11
, vgg13
, vgg16
, vgg19
, vgg11_bn
, vgg13_bn
, vgg16_bn
, vgg19_bn
. The number generally indicates the number of layers in the model, higher does not always mean better. The _bn
suffix means that the VGG model uses Batch Normalization layers, this generally leads to better results.
The pre-trained weights are taken from the torchvision library and are weights from a model that has been trained as an image classifier on ImageNet. Ideally, this means the images should be 3-channel color images that are at least 224x224 pixels and should be normalized with:
from torchvision import transforms
normalize = transforms.Normalize(mean=[0.485, 0.456, 0.406],
std=[0.229, 0.224, 0.225])
EfficientNet¶
EfficientNet is a family of convolutional neural network based models that are designed to be more efficient - in terms of the number of parameters and FLOPS - than previous computer vision models whilst maintaining equivalent image classification performance.
Important arguments:
name
the name of the EfficientNet variant to usepretrained
if pre-trained weights for the EfficientNet variant should be usedfreeze
if the parameters of the EfficientNet variant should be frozen
Available variants are: efficientnet_b0
, efficientnet_b1
, …, efficienet_b7
. With b0
having the least amount of parameters and b7
having the most.
The pre-trained weights are taken from the timm library and have been trained on ImageNet, thus the same tips, i.e. normalization, that apply to the VGG models also apply here.
Loading Pre-trained Modules¶
All modules - not classifiers - can load parameters from a saved state using the load
argument, which can either be a path to a saved PyTorch nn.Module
state_dict
or the state_dict
itself, e.g.:
from konfuzio.data import Project
from konfuzio.default_models import LabelSectionModel
# load a project
project = Project(id_=YOUR_PROJECT_ID)
# load the label classifier text module parameters from a given path
label_classifier_config = {'dropout_rate': 0.25,
'text_module': {'name': 'lstm',
'n_layers': 2,
'bidirectional': False,
'load': 'saved_modules/label_text_module.pt'}}
# load a saved state dict from a path
section_text_module_state_dict = torch.load('saved_modules/section_text_module.pt')
# load the section classifier text module parameters from a state_dict directly
section_classifier_config = {'dropout_rate': 0.5,
'text_module': {'name': 'nbow',
'emb_dim': 64,
'load': section_text_module_state_dict}}
# create label section model with classifiers that contain pre-trained text modules
model = LabelSectionModel(project,
tokenizer=tokenizer,
label_classifier_config=label_classifier_config,
section_classifier_config=section_classifier_config)
Extraction¶
Each model has an extract
method that gets the predictions from that model.
OCR¶
The ability to do OCR tasks in bundled into the FileScanner class. The FileScanner supports multuple OCR solution and takes text embeddings into account.
The following example runs OCR on a PDF with the default settings.
from konfuzio.ocr import FileScanner
path = 'example.pdf' # Path to a pdf or image file
with FileScanner(path) as f:
document_text: str = f.ocr()
In a first step the FileScanner checks if the file has some text embeddings and whether if its likely that the detected text embeddings cover the whole document. This is done by checking the frequency of specific characters like ‘e’, the ratio of ASCII characters, and the amount of character on the pages and the overall document.
If its likely that some characters are missing in the embeddings, the OCR process is started. The OCR text is then returned, except if the amount of OCR characters is less then the amount of text embeddings characters, in this case the text embeddings are used.
The default OCR process is based on tesseract with presets for images and scans. In case the document contains some text embeddings the scan preset is always used. If there are no text embeddings present the FileScanner uses a blurryness score to decide which preset should be used.
OCR with the Azure Read API¶
In order to use the Azure Read API you need to set the credentials of an appropriate azure accunt via environment variables or the .env file.
AZURE_OCR_BASE_URL = https://****.api.cognitive.microsoft.com
AZURE_OCR_KEY = **********************
from konfuzio.ocr import FileScanner
path = 'example.pdf' # Path to a pdf or image file
with FileScanner(path, ocr_method='read_v3_fast') as f:
document_text: str = f.ocr()
The way text embeddings are used does not differ from the default OCR, however no blurriness score is calculated for the Azure Read API.
The Azure Read API has some limititation regarding file size, page numbers, and rate limits. These lmites are updated over time and can be found here: https://docs.microsoft.com/en-us/azure/cognitive-services/computer-vision/concept-recognizing-text
Additional FileScanner OCR results¶
The FileScanner provides the ocr results in text, bounding bboxes and sandwich PDF (PDF with text embeddings).
from konfuzio.ocr import FileScanner
with FileScanner('example.jpeg') as v:
f_scanner.ocr()
f_scanner.text # str: String representation of the document or image
f_scanner.bbox # dict: Bounding boxes on a character level
f_scanner.sandwich_file # BytesIO: When using Azure you need to pass 'read_v3' as ocr_method to get the sandwich file.
f_scanner.is_blurry_image # boolean: Whether the image was blurry (only set for default OCR)
f_scanner.used_ocr_method # str: the OCR method used.
Further usages of the FileScanner¶
use_text_embedding_only
: In order to rely on text embeddings only, you can pass use_text_embedding_only=True
to the ocr() method call.
file
: A file like objects can be used to initialize the FileScanner (instead of the path
argument.)
LabelSectionModel extraction¶
from konfuzio.data import Project
from konfuzio.default_models import LabelSectionModel
# load the project
project = Project(id_=YOUR_PROJECT_ID)
# create a default label section model from the project
model = LabelSectionModel(project)
# build (i.e. train) the label section model
label_classifier_metrics, section_classifier_metrics = model.build()
# save the label section model
model.save('saved_label_section_model.pt')
# ... later on in another file
from konfuzio.default_models import load_label_section_model
# load the saved section label model
model = load_label_section_model('saved_label_section_model.pt')
pdf_texts: List[str] = [pdf1_text, pdf2_text, ...] # the text of each pdf, extracted via ocr
# list of extraction results for each document
results = [model.extract(pdf_text) for pdf_text in pdf_texts]
Each element of results is a Dict[str, Union[List[Dict[str, pd.DataFrame]], pd.DataFrame]]
, where the keys are the label and section names.
If the key is a label then the value is a DataFrame with columns Label
, Accuracy
, Candidate
, Translated Candidate
, Start
, and End
. Label
is the name of the label, Accuracy
is the confidence, Candidate
and Translated Candidate
are the actual token string, Start
, and End
are the start and end offsets (number of characters from the beginning of a document).
If the key is a section, the value is a list, one element for each detected instance of that section. Each element is a dictionary with the same format as the results
dictionary - i.e. keys are either labels or sections, values are DataFrames or list of dictionaries - with information about the labels and sections within that section. This recursive format allows nested sections.
DocumentModel extraction¶
from konfuzio.data import Project
from konfuzio.default_models import DocumentModel
# need to write `get_project` ourselves
projects = [get_project(project_id) for project_id in project_ids]
# create default document model from a list of projects
model = DocumentModel(projects)
# build (i.e. train) the document model
document_classifier_metrics = model.build()
# save the document model
model.save('saved_document_model.pt')
# ... later on in another file
from konfuzio.default_models import load_document_model
# load the saved document model
model = load_document_model('saved_document_model.pt')
pdf_paths: List[str] = ['data/1.pdf', 'data/2.pdf', ...] # path to the pdf file
pdf_texts: List[str] = [pdf1_text, pdf2_text, ...] # the text of each pdf, extracted via ocr
# list of extraction results for each document
results = [model.extract(pdf_path, pdf_text) for (pdf_path, pdf_text) in zip(pdf_paths, pdf_texts)]
Each element of results is a Tuple[str, float], pandas.DataFrame
. The first element of the tuple is the predicted label as a string, the second element is the confidence of that prediction, i.e. ('insurance_contract', 0.6)
. The DataFrame has a category
and a confidence
column. category
is the predicted label as a string for each class and confidence
is the confidence of each of the predictions.
Note: when a pdf has multiple pages the DocumentModel
makes a prediction on each page individually and then averages the predictions together.