The functionalities of the Trainer module are not yet available in the SDK.

LabelSectionModel Examples

LabelSectionModel diagram

A LabelSectionModel is a model that takes in a Document and predicts a Label per token and a SectionLabel for each line in the document.

Training our first LabelSectionModel

A LabelSectionModel contains both a LabelClassifier and SectionClassifier. Both classifiers have set default modules and training is performed with a set of default hyperparameters. The build method returns metrics for each classifier, a dictionary of lists with the loss and accuracy values per batch for training/evaluation, which can be used e.g. visualization.

from konfuzio.data import Project
from konfuzio.default_models import LabelSectionModel

# load the project
project = Project(id_=YOUR_PROJECT_ID)

# create a default label section model from the project
model = LabelSectionModel(project)

# build (i.e. train) the label section model
label_classifier_metrics, section_classifier_metrics = model.build()

# save the trained label section model
model.save()

Customizing the LabelSectionModel

We can also control the tokenization method and the hyperparameters of the LabelSectionModel. We change the tokenization from the default whitespace tokenization to BPE (byte-pair encoding) tokenization. The LabelClassifier now has a dropout of 0.25 and contains a 2-layer unidirectional LSTM module. The SectionClassifier now has a dropout of 0.5 and contains an NBOW module with a 64-dimensional embedding layer. Training both classifiers is still performed with the default training hyperparameters.

from konfuzio.data import Project
from konfuzio.default_models import LabelSectionModel
from konfuzio.tokenizers import BPETokenizer

project = Project(id_=YOUR_PROJECT_ID)

# specify a different tokenizer
tokenizer = BPETokenizer()

# configuration dict for the label classifier
label_classifier_config = {'dropout_rate': 0.25,
                           'text_module': {'name': 'lstm',
                                           'n_layers': 2,
                                           'bidirectional': False}}

# configuation dict for the section classifier
section_classifier_config = {'dropout_rate': 0.5,
                             'text_module': {'name': 'nbow',
                                             'emb_dim': 64,}}

# create label section model with chosen tokenizer and classifier configs
model = LabelSectionModel(project,
                          tokenizer=tokenizer,
                          label_classifier_config=label_classifier_config,
                          section_classifier_config=section_classifier_config)

# build the label section model with default training hyperparameters
model.build()

# save the trained label section model
model.save()

Setting the LabelSectionModel training hyperparameters

We’ll now use the default classifier hyperparameters but customize the training hyperparameters. An example of hyperparameters that can be changed: validation ratio, batch size, number of epochs, patience, optimizer, learning rate decay.

from konfuzio.data import Project
from konfuzio.default_models import LabelSectionModel

# load the project
project = Project(id_=YOUR_PROJECT_ID)

# create a default label section model from the project
model = LabelSectionModel(project)

label_training_config = {'valid_ratio': 0.15,  # what percentage of training data should be used to create validation data
                         'batch_size': 64,  # size of batches used for training
                         'seq_len': 100,  # number of sequential tokens to predict over
                         'n_epochs': 100,  # number of epochs to train for
                         'patience': 0,  # number of epochs without improvement in validation loss before we stop training
                         'optimizer': {'name': 'Adam', 'lr': 1e-5}}  # optimizer hyperparameters

section_training_config = {'valid_ratio': 0.2,
                           'batch_size': 128,
                           'max_len': 100,  # maximum tokens per line to consider
                           'n_epochs': 50,
                           'patience': 3,
                           'optimizer': {'name': 'RMSprop', 'lr': 1e-3, 'momentum': 0.9},
                           'lr_decay': 0.9}  # if validation loss does not improve, multiply learning rate by this value

# build model with training configs
model.build(label_training_config=label_training_config,
            section_training_config=section_training_config)

# save the trained label section model
model.save()

Customizing the LabelSectionModel model and training hyperparameters

Customizing both the LabelSectionModel hyperparameters and the training hyperparameters. This example combines the two above examples.

from konfuzio.data import Project
from konfuzio.default_models import LabelSectionModel
from konfuzio.tokenizers import BPETokenizer

# load the project
project = Project(id_=YOUR_PROJECT_ID)

# specify a different tokenizer
tokenizer = BPETokenizer()

# configuration dict for the label classifier
label_classifier_config = {'dropout_rate': 0.25,
                           'text_module': {'name': 'lstm',
                                           'n_layers': 2,
                                           'bidirectional': False}}

# configuation dict for the section classifier
section_classifier_config = {'dropout_rate': 0.5,
                             'text_module': {'name': 'nbow',
                                             'emb_dim': 64,}}

# create label section model with chosen tokenizer and classifier configs
model = LabelSectionModel(project,
                          tokenizer=tokenizer,
                          label_classifier_config=label_classifier_config,
                          section_classifier_config=section_classifier_config)

label_training_config = {'valid_ratio': 0.15,  # what percentage of training data should be used to create validation data
                         'batch_size': 64,  # size of batches used for training
                         'seq_len': 100,  # number of sequential tokens to predict over
                         'n_epochs': 100,  # number of epochs to train for
                         'patience': 0,  # number of epochs without improvement in validation loss before we stop training
                         'optimizer': {'name': 'Adam', 'lr': 1e-5}}  # optimizer hyperparameters

section_training_config = {'valid_ratio': 0.2,
                           'batch_size': 128,
                           'max_len': 100,  # maximum tokens per line to consider
                           'n_epochs': 50,
                           'patience': 3,
                           'optimizer': {'name': 'RMSprop', 'lr': 1e-3, 'momentum': 0.9},
                           'lr_decay': 0.9}  # if validation loss does not improve, multiply learning rate by this value

# build model with training configs
model.build(label_training_config=label_training_config,
            section_training_config=section_training_config)

# save the trained label section model
model.save()

Implementing a custom LabelSectionModel training loop

The build method of LabelSectionModel calls self.build_label_classifier and self.build_section_classifier. Both of which call a generic self.fit_classifier for both the label and section classifiers. If we want to customize the fit_classifier method - e.g. change the way the classifiers are trained based on some specific criteria, such as using custom loss function - then we can override the fit_classifier method. The custom fit_classifier method should take in the train, valid and test examples, the classifier, and any configuration arguments.

from konfuzio.data import Project
from konfuzio.default_models import LabelSectionModel

class CustomerSpecificModel(LabelSectionModel):
    def fit_classifier(self, train_examples: DataLoader, valid_examples: DataLoader, test_examples: DataLoader, classifier: Classifier, **kwargs) -> Dict[str, List[float]]:
        # new code must take in train/valid/test examples, the classifier model, and any configuration arguments supplied as kwargs from the config dict
        # custom code goes here
        return metrics

# load the project
project = Project(id_=YOUR_PROJECT_ID)

# can also use a custom tokenizer and model config here
model = CustomerSpecificModel(project)

# example custom configuration dicts
label_training_config = {'custom_loss_function_arg': 123}
section_training_config = {'custom_loss_function_arg': 999}

# build model with training configs
label_classifier_metrics, section_classifier_metrics = model.build(label_training_config,
                                                                   section_training_config)

# save the trained label section model
model.save()

Implementing a custom LabelSectionModel classifier training loop

By default, both classifiers use the same generic fit_classifier function. If we want each to have their own custom fit_classifier function then we can do so by overwriting the build_label_classifier/build_section_classifier functions and implementing a custom fit_label_classifier/fit_section_classifier function within them.

We can use existing functions to get the data iterators and then using our custom fit_label_classifier/fit_section_classifier functions in place of the generic fit_classifier function.

We could also customize the format of the training data by writing our own get_label_classifier_iterators/get_section_classifier_iterators functions. These must return a PyTorch DataLoader for the training, validation, and test sets, and must be compatible with the custom fit_label_classifier/fit_section_classifier functions.

Below is an example of how to implement our own fit_label_classifier function.

from konfuzio.data import Project
from konfuzio.default_models import LabelSectionModel
from konfuzio.default_models.utils import get_label_classifier_iterators
from konfuzio.default_models.utils import get_section_classifier_iterators

class CustomerSpecificModel(LabelSectionModel):
    def fit_label_classifier(self, train_examples: DataLoader, valid_examples: DataLoader, test_examples: DataLoader, classifier: Classifier, **kwargs) -> Dict[str, List[float]]:
        # custom code goes here.
        return metrics

    def build_label_classifier(self, label_training_config: dict = {}) -> Dict[str, List[float]]:

        # get the iterators over examples for the label classifier
        examples = get_label_classifier_iterators(self.projects,
                                                  self.tokenizer,
                                                  self.text_vocab,
                                                  self.label_vocab,
                                                  **label_training_config)

        # unpack the examples
        train_examples, valid_examples, test_examples = examples

        # place label classifier on device
        self.label_classifier = self.label_classifier.to(self.device)

        # now uses our custom fit_label_classifier instead of generic fit_classifier
        label_classifier_metrics = self.fit_label_classifier(train_examples,
                                                             valid_examples,
                                                             test_examples,
                                                             self.label_classifier,
                                                             **label_training_config)

        # put label classifier back on cpu to free up GPU memory
        self.label_classifier = self.label_classifier.to('cpu')

        return label_classifier_metrics

# load the project
project = Project(id_=YOUR_PROJECT_ID)

# define model with custom build_label_classifier function
model = CustomerSpecificModel(project)

# specify any custom training hyperparameters
label_training_config = {'custom_loss_function_arg': 123}

# build model with training configs
label_classifier_metrics, section_classifier_metrics = model.build(label_training_config,
                                                                   section_training_config)

# save the trained model
model.save()

DocumentModel Examples

DocumentModel Diagram

A DocumentModel takes pages as input and predicts the “category” (the project ID) for that page. It can use both image features (from a .png image of the page) and text features (from the OCR text from the page).

Training our first DocumentModel

A DocumentModel contains a DocumentClassifier. Similar to the LabelSectionModel, it has a set of default model hyperparameters and default training hyperparameters.

from konfuzio.data import Project
from konfuzio.default_models import DocumentModel

# need to write `get_project` ourselves
projects = [get_project(project_id) for project_id in project_ids]

# create default document model from a list of projects
model = DocumentModel(projects)

# build (i.e. train) the document model
document_classifier_metrics = model.build()

# save the document model
model.save()

Customizing the DocumentModel

Below we show how to implement a custom tokenizer, image preprocessing, image augmentation, and classifier. Image preprocessing is applied to images for training, evaluation, and inference. Image augmentation is only applied during training. We only need a multimodal_module when we use both text and image modules.

from konfuzio.data import Project
from konfuzio.default_models import DocumentModel
from konfuzio.tokenizers import BPETokenizer

# need to write `get_project` ourselves
projects = [get_project(project_id) for project_id in project_ids]

# specify a custom tokenizer
tokenizer = BPETokenizer()

# specify how images should be preprocessed
image_preprocessing = {'target_size': (1000, 1000),
                       'grayscale': True}

# specify how images should be augmented during training
image_augmentation = {'rotate': 5}

# configuration dict for the document classifier
document_classifier_config = {'image_module': {'name': 'efficientnet_b0',
                                               'freeze': False},
                              'text_module': {'name': 'lstm',
                                              'n_layers': 2},
                              'multimodal_module': {'name': 'concatenate',
                                                    'hid_dim': 512}}

# create document model with chosen tokenizer and classifier configs
model = DocumentModel(projects,
                      tokenizer=tokenizer,
                      image_preprocessing=image_preprocessing,
                      image_augmentation=image_augmentation,
                      document_classifier_config=document_classifier_config)


# build (i.e. train) the document model
document_classifier_metrics = model.build()

# save the document model
model.save()

Classifying documents using image features only

To use a DocumentModel that only uses the image of the document, simply do not include a text_module or multimodal_module in the classifier config. Passing a tokenizer when we have no text_module will throw an error, as there should be no text to tokenizer, so we make sure to pass None to the tokenizer argument like so:

from konfuzio.data import Project
from konfuzio.default_models import DocumentModel

# need to write `get_project` ourselves
projects = [get_project(project_id) for project_id in project_ids]

# needs to ensure we do not use a tokenizer as we are not using a text_module
tokenizer = None

# specify how images should be preprocessed
image_preprocessing = {'target_size': (1000, 1000),
                       'grayscale': True}

# specify how images should be augmented during training
image_augmentation = {'rotate': 5}

# configuration dict for the document classifier
# no text_module AND no multimodal_module
document_classifier_config = {'image_module': {'name': 'efficientnet_b0',
                                               'freeze': False}}

# create document model with chosen tokenizer and classifier configs
model = DocumentModel(projects,
                      tokenizer=tokenizer,
                      image_preprocessing=image_preprocessing,
                      image_augmentation=image_augmentation,
                      document_classifier_config=document_classifier_config)

# build (i.e. train) the document model
document_classifier_metrics = model.build()

# save the document model
model.save()

Classifying documents using text features only

To use a DocumentModel that only uses the image of the document, simply do not include an image_module or multimodal_module in the classifier config. Passing an image_preprocessing or image_augmentation argument when we have no image_module will throw an error so we need to ensure we pass None to the image_preprocessing and image_augmentation arguments like so:

from konfuzio.data import Project
from konfuzio.default_models import DocumentModel
from konfuzio.tokenizers import BPETokenizer

# need to write `get_project` ourselves
projects = [get_project(project_id) for project_id in project_ids]

tokenizer = BPETokenizer()

# both should be None when not using an image_module
image_preprocessing = None
image_augmentation = None

# configuration dict for the document classifier
# no image_module AND no multimodal_module
document_classifier_config = {'text_module': {'name': 'lstm',
                                              'n_layers': 2}}

# create document model with chosen tokenizer and classifier configs
model = DocumentModel(projects,
                      tokenizer=tokenizer,
                      image_preprocessing=image_preprocessing,
                      image_augmentation=image_augmentation,
                      document_classifier_config=document_classifier_config)

# build (i.e. train) the document model
document_classifier_metrics = model.build()

# save the document model
model.save()

Setting the DocumentModel training hyperparameters

Similar to the LabelSectionModel we can customize the training config which will work with ANY classifier/module combination:

from konfuzio.data import Project
from konfuzio.default_models import DocumentModel

# need to write `get_project` ourselves
projects = [get_project(project_id) for project_id in project_ids]

# create a default document model
model = DocumentModel(projects)

# define the custom training hyperparameters
document_training_config = {'valid_ratio': 0.2,
                            'batch_size': 128,
                            'max_len': 100,  # maximum tokens per page to consider, will do nothing if no text_module used
                            'n_epochs': 50,
                            'patience': 3,
                            'optimizer': {'name': 'RMSprop', 'lr': 1e-3, 'momentum': 0.9},
                            'lr_decay': 0.9}

# build (i.e. train) the document model with custom training hyperparameters
document_classifier_metrics = model.build(document_training_config=document_training_config)

# save the document model
model.save()

Implementing a custom DocumentModel training loop

We can also override the fit_classifier method to define our method of training the document classifier. We can do a similar thing with overwriting build to define our own custom data processing.

from konfuzio.data import Project
from konfuzio.default_models import DocumentModel

class CustomerSpecificModel(DocumentModel):
    def fit_classifier(self, train_examples: DataLoader, valid_examples: DataLoader, test_examples: DataLoader, classifier: Classifier, **kwargs) -> Dict[str, float]:
        # new code must take in train/valid/test examples, the classifier model, and kwargs from the config dict
        # custom code goes here
        return metrics

# need to write `get_project` ourselves
projects = [get_project(project_id) for project_id in project_ids]

# can also use a custom tokenizer and model config here
model = CustomerSpecificModel(project)

# custom fit_classifier config
custom_document_training_config = {'custom_loss_function_arg': 123}

# train document model with custom fit_classifier function
custom_document_classifier_metrics = model.build(document_training_config=custom_document_training_config)

# save a trained model
model.save()

ParagraphModel Examples

ParagraphModel Diagram

A ParagraphModel takes the text of the pages as input and predicts the “category” for each paragraph in that text. It uses text features (from the OCR text from the page).

Training our first ParagraphModel

A ParagraphModel contains a ParagraphClassifier. Similar to the DocumentModel, it has a set of default model hyperparameters and default training hyperparameters.

from konfuzio.data import Project
from konfuzio.default_models import ParagraphModel

# need to write `get_project` ourselves
projects = [get_project(project_id) for project_id in project_ids]

# create default paragraph model from a list of projects
model = ParagraphModel(projects)

# build (i.e. train) the paragraph model
paragraph_classifier_metrics = model.build()

# save the paragraph model
model.save()

Customizing the ParagraphModel

Below we show how to implement a custom tokenizer and classifier. We only need a multimodal_module when we use both text and image modules.

from konfuzio.data import Project
from konfuzio.default_models import DocumentModel
from konfuzio.tokenizers import BPETokenizer

# need to write `get_project` ourselves
projects = [get_project(project_id) for project_id in project_ids]

# specify a custom tokenizer
tokenizer = BPETokenizer()

# configuration dict for the paragraph classifier
paragraph_classifier_config = {'text_module': {'name': 'lstm',
                                               'n_layers': 2}}

# create document model with chosen tokenizer and classifier configs
model = ParagraphModel(projects,
                       tokenizer=tokenizer,
                       document_classifier_config=document_classifier_config)

# build (i.e. train) the paragraph model
paragraph_classifier_metrics = model.build()

# save the paragraph model
model.save()

Setting the ParagraphModel training hyperparameters

Similar to the DocumentModel we can customize the training config which will work with ANY classifier/module combination:

from konfuzio.data import Project
from konfuzio.default_models import ParagraphModel

# need to write `get_project` ourselves
projects = [get_project(project_id) for project_id in project_ids]

# create a default document model
model = ParagraphModel(projects)

# define the custom training hyperparameters
paragraph_training_config = {'valid_ratio': 0.2,
                             'batch_size': 128,
                             'max_len': 100,  # maximum tokens per page to consider, will do nothing if no text_module used
                             'n_epochs': 50,
                             'patience': 3,
                             'optimizer': {'name': 'RMSprop', 'lr': 1e-3, 'momentum': 0.9},
                             'lr_decay': 0.9
                             'no_label_limit': 10}

# build (i.e. train) the paragraph model with custom training hyperparameters
paragraph_classifier_metrics = model.build(paragraph_training_config=paragraph_training_config)

# save the paragraph model
model.save()

Design Philosophy

  • A “model” encapsulates everything we need for training on a labeled dataset and then performing extraction (inference) on some real data.

  • Models contain “classifiers”, one for each task the model is performing. The LabelSectionModel has a label classifier for POS tagging of tokens within a document and a section classifier for labeling sections of a document. The DocumentModel has a document classifier for predicting the category of each page within a document. Each classifier is a PyTorch nn.Module.

  • Each classifier is made up of “modules”. These are the main core of the classifier and are usually some kind of neural network architecture. Modules are split into three different categories: image modules (e.g. VGG and EfficientNet), text (e.g. NBOW, LSTM, and Transformer), and multimodal (e.g. concatenation of image and text features). Each module is also a PyTorch nn.Module.

  • Modules can contain other modules that can contain other modules - it’s modules all the way down!

  • The modules are designed to be agnostic to the classifier they are within, with the actual task-specific layers are contained in the classifier itself. Every text module takes in a sequence of tokens and outputs a sequence of tensors. Every image module takes in an image and outputs a tensor of features.

    • Let’s look at an example of a label classifier containing an LSTM module. The classifier takes in text, feeds it to the LSTM module which then calculates a hidden state per token within the text, and then returns this sequence of hidden states to the classifier which passes them through a linear layer to re-size them to the desired output dimensionality.

    • For a section classifier with an LSTM module: the classifier takes in the text, feeds it to the LSTM module which then returns the hidden states to the classifier, the classifier pools the hidden states and then passes them through a linear layer to make a prediction.

    • The document classifier with only a text module is the same as the section classifier. A document classifier with both a text and image module will calculate the text features and the image features, then pass both to a multimodal module that combines the two, performs some calculations to get multimodal features, and then passes these multimodal features to the classifier which makes a prediction.

Models are the glue holding everything together in a nice package. The most important model attributes are:

  • tokenizer - a tokenizer that tokenizes text by converting a string to a list of strings.

  • Vocabularies, which contain a mapping between a token (string) to an index (int), and vice versa. The types of vocabularies are:

    • text_vocab - the vocabulary for the document text

    • label_vocab - the vocabulary for the annotation labels (only if using a label classifier)

    • section_vocab - the vocabulary for the section labels (only if using a section classifier)

    • category_vocab - the vocabulary for the document categories (only if using a document classifier)

  • Classifiers, which are specific to the desired task. Each classifier contains module(s). The types of classifiers are:

    • label_classifier - an instance of a LabelClassifier used to predict the annotation label for each token (only used in a LabelSectionModel)

    • section_classifier - an instance of a SectionClassifier used to predict the section label for each line in the document (only in a LabelSectionModel)

    • document_classifier - used the predict the category of each page in the document (only in DocumentModel) and can be one of three types:

      • a DocumentTextClassifier which classifies a document’s page using only the text on that page

      • a DocumentImageClassifier which classifies a document’s page using only an image of the page

      • a DocumentMultiModalClassifier which classifiers a document’s page using both the text and image of that page

  • Configuration dictionaries, with one configuration dictionary per classifier used in the model, e.g. label_classifier_config. They are used to define the hyperparameters of the classifier and also used to load the correct classifier when loading the model. Training configuration dictionaries are not saved as a model attribute as the model should operate independently on how it is trained.

  • image_preprocessing - a dictionary that states how images should be pre-processed before being classified (only in DocumentModel)

  • image_augmentation - a dictionary that states how images should be augmented during the training of the classifier (only in DocumentModel)

Tokenizers

A tokenizer is a function that defines how a string should be separated into tokens (a list of strings), e.g.:

tokenizer.get_tokens('hello world') -> ['hello', 'world']

The konfuzio.tokenizers module contains a few tokenizers which can either be directly imported, i.e. from konfuzio.tokenizers import WhitespaceTokenizer or obtained using the get_tokenizer function, i.e.:

from konfuzio.tokenizers import get_tokenizer
tokenizer = get_tokenizer('whitespace')  # gets a WhitespaceTokenizer
from konfuzio.tokenizers import WhitespaceTokenizer
tokenizer = WhitespaceTokenzer()  # same as above

All tokenizers have the following methods:

  • get_tokens, takes in a string and returns a list of tokens (strings)

  • get_entities, same as get_tokens but also contains the start and end character offsets for each token (represented as a list of dicts). This is usually slower than get_tokens, so should only be used if we explicitly need the character offsets.

  • get_annotations, same as get_entities but converts each entity to an Annotation object and returns a list of annotations.

Currently available tokenizers:

WhitespaceTokenizer

from konfuzio.tokenizers import WhitespaceTokenizer
from konfuzio.tokenizers import get_tokenizer
tokenizer = get_tokenizer('whitespace')

Uses regular expressions to split a string based on whitespace. Very fast but naive method of tokenization.

SpacyTokenizer

from konfuzio.tokenizers import SpacyTokenizer
from konfuzio.tokenizers import get_tokenizer
tokenizer = get_tokenizer('spacy')

Tokenizes using the de_core_news_sm spaCy model. Relatively slow.

PhraseMatcherTokenizer

from konfuzio.data import Project
from konfuzio.tokenizers import PhraseMatcherTokenizer
from konfuzio.tokenizers import get_tokenizer
project = Project(id_=YOUR_PROJECT_ID)
tokenizer = get_tokenizer('phrasematcher', project)

Note, this tokenizer also has to take in a Project, or list of Project, to build the PhraseMatcher.

This builds a spaCy de_core_news_sm phrase matcher using the annotation labels for each project and then uses this learned matching to tokenize data. We can think of this as learning a simple regex pattern matcher from the data. This is relatively slow, especially when the dataset is large.

BPETokenizer

from konfuzio.data import Project
from konfuzio.tokenizers import BPETokenizer
from konfuzio.tokenizers import get_tokenizer
tokenizer = get_tokenizer('bert-base-german-cased')

Gets a pre-trained byte-pair encoding tokenizer from the HuggingFace Transformers library. Officially we support four different variants of the BPETokenizer (other variants should work, however they are not tested):

tokenizer = BPETokenizer('bert-base-german-cased')
tokenizer = BPETokenizer('bert-base-german-dbmdz-cased')
tokenizer = BPETokenizer('bert-base-german-dbmdz-uncased')
tokenizer = BPETokenizer('distilbert-base-german-cased')

By default, BPETokenizer gets the bert-base-german-cased variant.

# both of these get the same tokenizer
tokenizer = BPETokenizer()
tokenizer = BPETokenizer('bert-base-german-cased')

These tokenizers are special as they have their own custom vocabulary - accessed via tokenizer.vocab - which is because they are designed to be used with a pre-trained Transformer model that must use the vocabulary it was trained with. Initializing a model with a BPETokenizer automatically sets the text_vocab to the tokenizer.vocab when using a BPE tokenizer. These tokenizers still perform very well with non-Transformer models and are also the fastest tokenizers.

Vocabularies

A vocabulary is a mapping between tokens and integers.

A vocab is usually initialized with collections.Counter - a dictionary where the keys are the tokens and the values are how many times that token appears in the training set. It can also be initialized with a dict that has the keys being the tokens and the values being the integer representations, this is used to create a Vocab object from an existing vocabulary already represented as a dictionary.

Two vocab arguments are max_size and min_freq. max_size of 30,000 means that only the most common 30,000 tokens are used to create the vocabulary. min_freq of 2 means that only tokens that appear at least twice are used to create the vocabulary.

Two other arguments are unk_token and pad_token. When trying to convert a token to an integer and the token is NOT in the vocabulary then this token is replaced by an unk_token. If the unk_token is set to None then the vocabulary will throw an error when it tries to convert a token that is not in the vocabulary to an integer - this is usually used when created a vocabulary over the labels. A pad_token is a token we will use for padding sequences, effectively a no-op, and can also be None. If the unk_token and pad_tokens are not None then the vocab will also have a unk_idx and pad_idx attribute which gets the integer value of the unk_token and pad_token - this is more for convenience than anything else.

The final argument is special_tokens, a list of tokens that are guaranteed to appear in the vocabulary.

To convert from a token to an integer, use the stoi (string to int) method, i.e. vocab.stoi('hello'). To convert from an integer to a token, use the itos (int to string) method, i.e. vocab.itos(123).

We can get the list of all the tokens within the vocab with vocab.get_tokens, and a list of all integers with vocab.get_indexes.

The LabelSectionModel and DocumentModel will create a text_vocab with the provided tokenizer unless:

  • a text_vocab is provided, at which point the model will use that vocab and not create one from the data. This is usually done when loading a trained model.

  • the tokenizer used has a vocab, where we also use that vocab and do not create one. Usually, the case when using a BPE tokenizer.

For the label_vocab, section_vocab, and category_vocab, one is created from the data unless one is provided. Again, a provided vocab usually means we are loading a saved model.

Text Modules

There are currently four text modules available. Each module takes a sequence of tokens as input and outputs a sequence of “hidden states”, i.e. one vector per input token. The size of each of the hidden states can be found with the module’s n_features parameter.

NBOW

The neural bag-of-words (NBOW) model is the simplest of models, it simply passes each token through an embedding layer. As shown in the fastText paper this model is still able to achieve comparable performance to some deep learning models whilst being considerably faster.

One downside of this model is that tokens are embedded without regards to the surrounding context in which they appear, e.g. the embedding for “May” in the two sentences “May I speak to you?” and “I am leaving on the 1st of May” are identical, even though they have different semantics.

Important arguments:

  • emb_dim the dimensions of the embedding vector

  • dropout_rate the amount of dropout applied to the embedding vectors

NBOWSelfAttention

This is an NBOW model with a multi-headed self-attention layer, detailed here, added after the embedding layer. This effectively contextualizes the output as now each hidden state is now calculated from the embedding vector of a token and the embedding vector of all other tokens within the sequence.

Important arguments:

  • emb_dim the dimensions of the embedding vector

  • dropout_rate the amount of dropout applied to the embedding vectors

  • n_heads the number of attention heads to use in the multi-headed self-attention layer. Note that n_heads must be a factor of emb_dim, i.e. emb_dim % n_heads == 0.

LSTM

The LSTM (long short-term memory) is a variant of a RNN (recurrent neural network). It feeds the input tokens through an embedding layer and then processes them sequentially with the LSTM, outputting a hidden state for each token. If the LSTM is bi-directional then it trains a forward and backward LSTM per layer and concatenates the forward and backward hidden states for each token.

Important arguments:

  • emb_dim the dimensions of the embedding vector

  • hid_dim the dimensions of the hidden states

  • n_layers how many LSTM layers to use

  • bidirectional if the LSTM should be bidirectional

  • dropout_rate the amount of dropout applied to the embedding vectors and between LSTM layers if n_layers > 1

BERT

BERT (bi-directional encoder representations from Transformers) is a family of large Transformer models. The available BERT variants are all pre-trained models provided by the transformers library. It is usually infeasible to train a BERT model from scratch due to the significant amount of computation required. However, the pre-trained models can be easily fine-tuned on desired data.

Important arguments:

  • name the name of the pre-trained BERT variant to use

  • freeze should the BERT model be frozen, i.e. the pre-trained parameters are not updated

The BERT variants, i.e. name arguments, that are covered by internal tests are:

  • 'bert-base-german-cased'

  • 'bert-base-german-dbmdz-cased'

  • 'bert-base-german-dbmdz-uncased'

  • 'distilbert-base-german-cased'

In theory, all variants beginning with bert-base-* and distilbert-* should work out of the box. Other BERT variants come with no guarantees.

Image Modules

We currently have two image modules available, each have several variants. The image models each have their classification heads removed and generally, they return the output of the final pooling layer within the model which has been flattened to a [batch_size, n_features] tensor, where n_features is an attribute of the model.

VGG

The VGG family of models are image classification models designed for the ImageNet. They are usually used as a baseline in image classification tasks, however are considerably larger - in terms of the number of parameters - than modern architectures.

Important arguments:

  • name the name of the VGG variant to use

  • pretrained if pre-trained weights for the VGG variant should be used

  • freeze if the parameters of the VGG variant should be frozen

Available variants are: vgg11, vgg13, vgg16, vgg19, vgg11_bn, vgg13_bn, vgg16_bn, vgg19_bn. The number generally indicates the number of layers in the model, higher does not always mean better. The _bn suffix means that the VGG model uses Batch Normalization layers, this generally leads to better results.

The pre-trained weights are taken from the torchvision library and are weights from a model that has been trained as an image classifier on ImageNet. Ideally, this means the images should be 3-channel color images that are at least 224x224 pixels and should be normalized with:

from torchvision import transforms
normalize = transforms.Normalize(mean=[0.485, 0.456, 0.406],
                                 std=[0.229, 0.224, 0.225])

EfficientNet

EfficientNet is a family of convolutional neural network based models that are designed to be more efficient - in terms of the number of parameters and FLOPS - than previous computer vision models whilst maintaining equivalent image classification performance.

Important arguments:

  • name the name of the EfficientNet variant to use

  • pretrained if pre-trained weights for the EfficientNet variant should be used

  • freeze if the parameters of the EfficientNet variant should be frozen

Available variants are: efficientnet_b0, efficientnet_b1, …, efficienet_b7. With b0 having the least amount of parameters and b7 having the most.

The pre-trained weights are taken from the timm library and have been trained on ImageNet, thus the same tips, i.e. normalization, that apply to the VGG models also apply here.

Loading Pre-trained Modules

All modules - not classifiers - can load parameters from a saved state using the load argument, which can either be a path to a saved PyTorch nn.Module state_dict or the state_dict itself, e.g.:

from konfuzio.data import Project
from konfuzio.default_models import LabelSectionModel

# load a project
project = Project(id_=YOUR_PROJECT_ID)

# load the label classifier text module parameters from a given path
label_classifier_config = {'dropout_rate': 0.25,
                           'text_module': {'name': 'lstm',
                                           'n_layers': 2,
                                           'bidirectional': False,
                                           'load': 'saved_modules/label_text_module.pt'}}

# load a saved state dict from a path
section_text_module_state_dict = torch.load('saved_modules/section_text_module.pt')

# load the section classifier text module parameters from a state_dict directly
section_classifier_config = {'dropout_rate': 0.5,
                             'text_module': {'name': 'nbow',
                                             'emb_dim': 64,
                                             'load': section_text_module_state_dict}}

# create label section model with classifiers that contain pre-trained text modules
model = LabelSectionModel(project,
                          tokenizer=tokenizer,
                          label_classifier_config=label_classifier_config,
                          section_classifier_config=section_classifier_config)

Extraction

Each model has an extract method that gets the predictions from that model.

OCR

The ability to do OCR tasks in bundled into the FileScanner class. The FileScanner supports multuple OCR solution and takes text embeddings into account.

The following example runs OCR on a PDF with the default settings.

from konfuzio.ocr import FileScanner

path = 'example.pdf'  # Path to a pdf or image file

with FileScanner(path) as f:
  document_text: str = f.ocr()

In a first step the FileScanner checks if the file has some text embeddings and whether if its likely that the detected text embeddings cover the whole document. This is done by checking the frequency of specific characters like ‘e’, the ratio of ASCII characters, and the amount of character on the pages and the overall document.

If its likely that some characters are missing in the embeddings, the OCR process is started. The OCR text is then returned, except if the amount of OCR characters is less then the amount of text embeddings characters, in this case the text embeddings are used.

The default OCR process is based on tesseract with presets for images and scans. In case the document contains some text embeddings the scan preset is always used. If there are no text embeddings present the FileScanner uses a blurryness score to decide which preset should be used.

OCR with the Azure Read API

In order to use the Azure Read API you need to set the credentials of an appropriate azure accunt via environment variables or the .env file.

AZURE_OCR_BASE_URL = https://****.api.cognitive.microsoft.com
AZURE_OCR_KEY = **********************
from konfuzio.ocr import FileScanner

path = 'example.pdf'  # Path to a pdf or image file

with FileScanner(path, ocr_method='read_v3_fast') as f:
  document_text: str = f.ocr()

The way text embeddings are used does not differ from the default OCR, however no blurriness score is calculated for the Azure Read API.

The Azure Read API has some limititation regarding file size, page numbers, and rate limits. These lmites are updated over time and can be found here: https://docs.microsoft.com/en-us/azure/cognitive-services/computer-vision/concept-recognizing-text

Additional FileScanner OCR results

The FileScanner provides the ocr results in text, bounding bboxes and sandwich PDF (PDF with text embeddings).

from konfuzio.ocr import FileScanner

with FileScanner('example.jpeg') as v:
  f_scanner.ocr()

f_scanner.text  # str: String representation of the document or image
f_scanner.bbox  # dict: Bounding boxes on a character level
f_scanner.sandwich_file  # BytesIO: When using Azure you need to pass 'read_v3' as ocr_method to get the sandwich file.
f_scanner.is_blurry_image  # boolean: Whether the image was blurry (only set for default OCR)
f_scanner.used_ocr_method  # str: the OCR method used.

Further usages of the FileScanner

use_text_embedding_only: In order to rely on text embeddings only, you can pass use_text_embedding_only=True to the ocr() method call.

file: A file like objects can be used to initialize the FileScanner (instead of the path argument.)

LabelSectionModel extraction

from konfuzio.data import Project
from konfuzio.default_models import LabelSectionModel

# load the project
project = Project(id_=YOUR_PROJECT_ID)

# create a default label section model from the project
model = LabelSectionModel(project)

# build (i.e. train) the label section model
label_classifier_metrics, section_classifier_metrics = model.build()

# save the label section model
model.save('saved_label_section_model.pt')

# ... later on in another file

from konfuzio.default_models import load_label_section_model

# load the saved section label model
model = load_label_section_model('saved_label_section_model.pt')

pdf_texts: List[str] = [pdf1_text, pdf2_text, ...]  # the text of each pdf, extracted via ocr

# list of extraction results for each document
results = [model.extract(pdf_text) for pdf_text in pdf_texts]

Each element of results is a Dict[str, Union[List[Dict[str, pd.DataFrame]], pd.DataFrame]], where the keys are the label and section names.

If the key is a label then the value is a DataFrame with columns Label, Accuracy, Candidate, Translated Candidate, Start, and End. Label is the name of the label, Accuracy is the confidence, Candidate and Translated Candidate are the actual token string, Start, and End are the start and end offsets (number of characters from the beginning of a document).

If the key is a section, the value is a list, one element for each detected instance of that section. Each element is a dictionary with the same format as the results dictionary - i.e. keys are either labels or sections, values are DataFrames or list of dictionaries - with information about the labels and sections within that section. This recursive format allows nested sections.

DocumentModel extraction

from konfuzio.data import Project
from konfuzio.default_models import DocumentModel

# need to write `get_project` ourselves
projects = [get_project(project_id) for project_id in project_ids]

# create default document model from a list of projects
model = DocumentModel(projects)

# build (i.e. train) the document model
document_classifier_metrics = model.build()

# save the document model
model.save('saved_document_model.pt')

# ... later on in another file

from konfuzio.default_models import load_document_model

# load the saved document model
model = load_document_model('saved_document_model.pt')

pdf_paths: List[str] = ['data/1.pdf', 'data/2.pdf', ...]  # path to the pdf file
pdf_texts: List[str] = [pdf1_text, pdf2_text, ...]  # the text of each pdf, extracted via ocr

# list of extraction results for each document
results = [model.extract(pdf_path, pdf_text) for (pdf_path, pdf_text) in zip(pdf_paths, pdf_texts)]

Each element of results is a Tuple[str, float], pandas.DataFrame. The first element of the tuple is the predicted label as a string, the second element is the confidence of that prediction, i.e. ('insurance_contract', 0.6). The DataFrame has a category and a confidence column. category is the predicted label as a string for each class and confidence is the confidence of each of the predictions.

Note: when a pdf has multiple pages the DocumentModel makes a prediction on each page individually and then averages the predictions together.