Prepare Training and Testing Data


Prerequisites:

  • Basic knowledge about what training and testing means in the context of AI models

  • Be familiar with uploading Documents using the Konfuzio UI

  • Have access to a Konfuzio Project or have the permissions to create one

Difficulty: Easy

Goal: Be able to programmatically upload Documents within a Konfuzio Project.


Environment

You need to install the Konfuzio SDK before diving into the tutorial.
To get up and running quickly, you can use our Colab Quick Start notebook.
Open In Colab

As an alternative you can follow the installation section to install and initialize the Konfuzio SDK locally or on an environment of your choice.

Introduction

Before training an AI model, Documents for training and testing need to be uploaded within a Konfuzio Project. This can be done in two ways: using the server interface, detailed instruction to upload Documents are here. Alternatively, to reduce the manual work of uploading many Documents at once the instructions described in this tutorial can be followed.

Upload Documents

We start by defining a path where a PDF Document exists.

FILE_PATH_1 = 'path/to/pdf_file1.pdf'
FILE_PATH_2 = 'path/to/pdf_file2.pdf'
FILE_PATH_3 = 'path/to/pdf_file3.pdf'

We import the id of the default test Project, as well as the libraries we need:

from konfuzio_sdk.data import Project, Document

We now initialize a Konfuzio Project object and a list of paths with the PDFs of interest:

project = Project(id_=TEST_PROJECT_ID)
file_paths = [FILE_PATH_1, FILE_PATH_2, FILE_PATH_3]

Then, create a new Document from each PDF in the list of paths file_paths:

for document_path in file_paths:
    _ = Document.from_file(document_path, project=project, sync=False)
    print(f'Document {_.id_} successfully created.')
Document 5885494 successfully created.
Document 5885496 successfully created.
Document 5885497 successfully created.

The Document.from_file method uploads a new Document to the Konfuzio server. The documentation provides an overview of the returned values and optional paramenters for this method.

Optional: Alter Document Status

A Document in the Konfuzio Server can be assigned a Status defining if the Document belongs to the Training or Testing data-set. Setting a Status for a Document is necessary to determine which Documents will be used for Training (status 1) and which will be used for Testing (status 2). More information can be found in the Konfuzio Manual, and an example can be found here.

Conclusion

In this tutorial, we have walked through the essential steps for programmatically uploading PDFs to the Konfuzio Server. Below is the full code to accomplish this task:

from konfuzio_sdk.data import Project, Document

FILE_PATH_1 = 'path/to/pdf_file1.pdf'
FILE_PATH_2 = 'path/to/pdf_file2.pdf'
FILE_PATH_3 = 'path/to/pdf_file3.pdf'

project = Project(id_=TEST_PROJECT_ID)
file_paths = [FILE_PATH_1, FILE_PATH_2, FILE_PATH_3]

for document_path in file_paths:
    _ = Document.from_file(document_path, project=project, sync=False)
    print(f'Document {_.id_} successfully created.')