Prepare Training and Testing Data¶
Prerequisites:
Basic knowledge about what training and testing means in the context of AI models
Be familiar with uploading Documents using the Konfuzio UI
Have access to a Konfuzio Project or have the permissions to create one
Difficulty: Easy
Goal: Be able to programmatically upload Documents within a Konfuzio Project.
Environment¶
You need to install the Konfuzio SDK before diving into the tutorial.
To get up and running quickly, you can use our Colab Quick Start notebook.
As an alternative you can follow the installation section to install and initialize the Konfuzio SDK locally or on an environment of your choice.
Introduction¶
Before training an AI model, Documents for training and testing need to be uploaded within a Konfuzio Project. This can be done in two ways: using the server interface, detailed instruction to upload Documents are here. Alternatively, to reduce the manual work of uploading many Documents at once the instructions described in this tutorial can be followed.
Upload Documents¶
We start by defining a path where a PDF Document exists.
FILE_PATH_1 = 'path/to/pdf_file1.pdf'
FILE_PATH_2 = 'path/to/pdf_file2.pdf'
FILE_PATH_3 = 'path/to/pdf_file3.pdf'
We import the id of the default test Project, as well as the libraries we need:
from konfuzio_sdk.data import Project, Document
We now initialize a Konfuzio Project object and a list of paths with the PDFs of interest:
project = Project(id_=TEST_PROJECT_ID)
file_paths = [FILE_PATH_1, FILE_PATH_2, FILE_PATH_3]
Then, create a new Document from each PDF in the list of paths file_paths
:
for document_path in file_paths:
_ = Document.from_file(document_path, project=project, sync=False)
print(f'Document {_.id_} successfully created.')
Document 5885494 successfully created.
Document 5885496 successfully created.
Document 5885497 successfully created.
The Document.from_file
method uploads a new Document to the Konfuzio server. The documentation provides an overview of the returned values and optional paramenters for this method.
Optional: Alter Document Status¶
A Document in the Konfuzio Server can be assigned a Status defining if the Document belongs to the Training or Testing data-set. Setting a Status for a Document is necessary to determine which Documents will be used for Training (status 1) and which will be used for Testing (status 2). More information can be found in the Konfuzio Manual, and an example can be found here.
Conclusion¶
In this tutorial, we have walked through the essential steps for programmatically uploading PDFs to the Konfuzio Server. Below is the full code to accomplish this task:
from konfuzio_sdk.data import Project, Document
FILE_PATH_1 = 'path/to/pdf_file1.pdf'
FILE_PATH_2 = 'path/to/pdf_file2.pdf'
FILE_PATH_3 = 'path/to/pdf_file3.pdf'
project = Project(id_=TEST_PROJECT_ID)
file_paths = [FILE_PATH_1, FILE_PATH_2, FILE_PATH_3]
for document_path in file_paths:
_ = Document.from_file(document_path, project=project, sync=False)
print(f'Document {_.id_} successfully created.')