{ "cells": [ { "cell_type": "markdown", "id": "5f98ce0b", "metadata": {}, "source": [ "## Prepare Training and Testing Data\n", "\n", "---\n", "\n", "**Prerequisites:**\n", "- Basic knowledge about what training and testing means in the context of AI models\n", "- Be familiar with uploading [Documents](https://help.konfuzio.com/modules/documents/index.html) using the Konfuzio UI\n", "- Have access to a [Konfuzio Project](https://help.konfuzio.com/modules/projects/index.html) or have the permissions to create one\n", "\n", "**Difficulty:** Easy\n", "\n", "**Goal:** Be able to programmatically upload Documents within a Konfuzio Project.\n", "\n", "---\n", "\n", "### Environment\n", "You need to install the Konfuzio SDK before diving into the tutorial. \\\n", "To get up and running quickly, you can use our Colab Quick Start notebook. \\\n", "\"Open\n", "\n", "As an alternative you can follow the [installation section](../get_started.html#install-sdk) to install and initialize the Konfuzio SDK locally or on an environment of your choice.\n", "\n", "### Introduction\n", "Before training an AI model, Documents for training and testing need to be uploaded within a Konfuzio Project. This can be done in two ways: using the server interface, detailed instruction to upload Documents are [here](https://help.konfuzio.com/modules/documents/index.html). Alternatively, to reduce the manual work of uploading many Documents at once the instructions described in this tutorial can be followed." ] }, { "cell_type": "markdown", "id": "9b0751f9", "metadata": {}, "source": [ "### Upload Documents" ] }, { "cell_type": "markdown", "id": "e9876078", "metadata": {}, "source": [ "We start by defining a path where a PDF Document exists." ] }, { "cell_type": "code", "execution_count": null, "id": "13035070", "metadata": { "tags": [ "skip-execution", "nbval-skip" ] }, "outputs": [], "source": [ "FILE_PATH_1 = 'path/to/pdf_file1.pdf'\n", "FILE_PATH_2 = 'path/to/pdf_file2.pdf'\n", "FILE_PATH_3 = 'path/to/pdf_file3.pdf'" ] }, { "cell_type": "code", "execution_count": null, "id": "a478c54f", "metadata": { "tags": [ "remove-cell" ] }, "outputs": [], "source": [ "# This cell gets removed when the notebook is compiled as markdown\n", "FILE_PATH = '../../../../tests/test_data/pdf.pdf'\n", "\n", "# Use the same file for the sake of local testing\n", "FILE_PATH_1 = FILE_PATH_2 = FILE_PATH_3 = FILE_PATH" ] }, { "cell_type": "markdown", "id": "f036cbd5", "metadata": {}, "source": [ "We import the id of the default test Project, as well as the libraries we need:" ] }, { "cell_type": "code", "execution_count": null, "id": "cf646112", "metadata": { "tags": [ "remove-cell" ] }, "outputs": [], "source": [ "# This is necessary to make sure we can import from 'tests'\n", "import sys\n", "sys.path.insert(0, '../../../../')\n", "from tests.variables import TEST_PROJECT_ID" ] }, { "cell_type": "code", "execution_count": null, "id": "335efcb0", "metadata": {}, "outputs": [], "source": [ "from konfuzio_sdk.data import Project, Document" ] }, { "cell_type": "markdown", "id": "fc11d696", "metadata": {}, "source": [ "We now initialize a Konfuzio Project object and a list of paths with the PDFs of interest: " ] }, { "cell_type": "code", "execution_count": null, "id": "5ce992c5", "metadata": { "tags": [ "remove-output" ] }, "outputs": [], "source": [ "project = Project(id_=TEST_PROJECT_ID)\n", "file_paths = [FILE_PATH_1, FILE_PATH_2, FILE_PATH_3]" ] }, { "cell_type": "markdown", "id": "80d3d5a6", "metadata": {}, "source": [ "Then, create a new Document from each PDF in the list of paths `file_paths`:" ] }, { "cell_type": "code", "execution_count": null, "id": "fdc72ffc", "metadata": { "lines_to_next_cell": 0 }, "outputs": [], "source": [ "for document_path in file_paths:\n", " _ = Document.from_file(document_path, project=project, sync=False)\n", " print(f'Document {_.id_} successfully created.')" ] }, { "cell_type": "code", "execution_count": null, "id": "20c3d9bc", "metadata": { "lines_to_next_cell": 0, "tags": [ "remove-cell" ] }, "outputs": [], "source": [ "project = Project(id_=TEST_PROJECT_ID, update=True)\n", "for document in project._documents[-len(file_paths):]:\n", " document.delete(delete_online=True)" ] }, { "cell_type": "markdown", "id": "ec35c16e", "metadata": {}, "source": [ "The `Document.from_file` method uploads a new Document to the Konfuzio server. The [documentation](https://dev.konfuzio.com/sdk/sourcecode.html#document) provides an overview of the returned values and optional paramenters for this method." ] }, { "cell_type": "markdown", "id": "de98b5a3", "metadata": { "link": "get_started.html#modify-document" }, "source": [ "### Optional: Alter Document Status\n", "A Document in the Konfuzio Server can be assigned a Status defining if the Document belongs to the Training or Testing data-set. Setting a Status for a Document is necessary to determine which Documents will be used for Training (status 1) and which will be used for Testing (status 2). More information can be found in the [Konfuzio Manual](https://help.konfuzio.com/modules/documents/index.html#id1), and an example can be found [here](https://dev.konfuzio.com/sdk/get_started.html#modify-document)." ] }, { "cell_type": "markdown", "id": "bfbffa29", "metadata": {}, "source": [ "### Conclusion\n", "In this tutorial, we have walked through the essential steps for programmatically uploading PDFs to the Konfuzio Server. Below is the full code to accomplish this task:" ] }, { "cell_type": "code", "execution_count": null, "id": "dfb87606", "metadata": { "tags": [ "skip-execution", "nbval-skip" ] }, "outputs": [], "source": [ "from konfuzio_sdk.data import Project, Document\n", "\n", "FILE_PATH_1 = 'path/to/pdf_file1.pdf'\n", "FILE_PATH_2 = 'path/to/pdf_file2.pdf'\n", "FILE_PATH_3 = 'path/to/pdf_file3.pdf'\n", "\n", "project = Project(id_=TEST_PROJECT_ID)\n", "file_paths = [FILE_PATH_1, FILE_PATH_2, FILE_PATH_3]\n", "\n", "for document_path in file_paths:\n", " _ = Document.from_file(document_path, project=project, sync=False)\n", " print(f'Document {_.id_} successfully created.')" ] } ], "metadata": { "kernelspec": { "display_name": "konfuzio", "language": "python", "name": "python3" } }, "nbformat": 4, "nbformat_minor": 5 }