{ "cells": [ { "cell_type": "markdown", "id": "a59ecdd2", "metadata": {}, "source": [ "### Environment\n", "You need to install the Konfuzio SDK before diving into the tutorial. \\\n", "To get up and running quickly, you can use our Colab Quick Start notebook. \\\n", "\"Open\n", "\n", "As an alternative you can follow the [installation section](../get_started.html#install-sdk) to install and initialize the Konfuzio SDK locally or on an environment of your choice.\n", "\n", "## Example Usage\n", "\n", "Make sure to set up your Project (so that you can retrieve the Project ID) using our [Konfuzio Guide](https://help.konfuzio.com/tutorials/quickstart/index.html).\n", "\n", "### Project\n", "\n", "Retrieve all information available for your Project:" ] }, { "cell_type": "code", "execution_count": null, "id": "f6ea5de7", "metadata": { "lines_to_next_cell": 0, "tags": [ "remove-cell" ] }, "outputs": [], "source": [ "import logging\n", "from konfuzio_sdk.api import get_project_list\n", "from tests.variables import TEST_SNAPSHOT_ID\n", "\n", "logging.getLogger(\"konfuzio_sdk\").setLevel(logging.ERROR)\n", "projects = get_project_list()\n", "# we want to get the last instance of a project restored from a snapshot because creating a new one each time takes longer \n", "YOUR_PROJECT_ID = next(project['id'] for project in reversed(projects) if TEST_SNAPSHOT_ID in project['name'])" ] }, { "cell_type": "code", "execution_count": null, "id": "547e9f76", "metadata": {}, "outputs": [], "source": [ "from konfuzio_sdk.data import Project, Document\n", "\n", "my_project = Project(id_=YOUR_PROJECT_ID)" ] }, { "cell_type": "markdown", "id": "79f79cc1", "metadata": {}, "source": [ "The information will be stored in the folder that you defined to allocate the data in the package initialization.\n", "A subfolder will be created for each Document in the Project.\n", "\n", "Every time that there are changes in the Project in the Konfuzio Server, the local Project can be updated this way:" ] }, { "cell_type": "code", "execution_count": null, "id": "e191c323", "metadata": {}, "outputs": [], "source": [ "my_project.get(update=True)" ] }, { "cell_type": "markdown", "id": "183452cf", "metadata": {}, "source": [ "To make sure that your Project is loaded with all the latest data:" ] }, { "cell_type": "code", "execution_count": null, "id": "8fd23cdf", "metadata": {}, "outputs": [], "source": [ "my_project = Project(id_=YOUR_PROJECT_ID, update=True)" ] }, { "cell_type": "markdown", "id": "ff76854b", "metadata": {}, "source": [ "### Documents\n", "\n", "Every Document has a status indicating in what stage of processing it is. The code for the Document status is:\n", "\n", " - Queuing for OCR: 0\n", " - Queuing for extraction: 1\n", " - Done: 2\n", " - Could not be processed: 111\n", " - OCR in progress: 10\n", " - Extraction in progress: 20\n", " - Queuing for categorization: 3\n", " - Categorization in progress: 30\n", " - Queuing for splitting: 4\n", " - Splitting in progress: 40\n", " - Waiting for splitting confirmation: 41\n", "\n", "To access the Documents in the Project you can use:" ] }, { "cell_type": "code", "execution_count": null, "id": "fed0e471", "metadata": {}, "outputs": [], "source": [ "documents = my_project.documents" ] }, { "cell_type": "markdown", "id": "71429097", "metadata": {}, "source": [ "By default, it will get the Documents with training status (dataset_status = 2). The code for the dataset status is:\n", "\n", "- None: 0\n", "- Preparation: 1\n", "- Training: 2\n", "- Test: 3\n", "- Excluded: 4\n", "\n", "The Test Documents can be accessed directly by:" ] }, { "cell_type": "code", "execution_count": null, "id": "24f0acf1", "metadata": {}, "outputs": [], "source": [ "test_documents = my_project.test_documents" ] }, { "cell_type": "markdown", "id": "a551a3b5", "metadata": {}, "source": [ "For more details, you can check out the [Project Documentation](https://dev.konfuzio.com/sdk/sourcecode.html#project)." ] }, { "cell_type": "markdown", "id": "c062fa42", "metadata": {}, "source": [ "By default, you get 4 files for each Document that contain information of the text, Pages, Annotation Sets and \n", "Annotations. You can see these files inside the Document folder.\n", "\n", "`.txt` file contains the text of the Document. If OCR was used, it will correspond to the result from the OCR.\n", "\n", "```\n", " x02 328927/10103/00104\n", "Abrechnung der Brutto/Netto-Bezüge für Dezember 2018 22.05.2018 Bat: 1\n", "\n", "Personal-Nr. Geburtsdatum ski Faktor Ki,Frbtr.Konfession ‚Freibetragjährl.! |Freibetrag mt! |DBA iGleitzone 'St.-Tg. VJuUr. üb. |Url. Anspr. Url.Tg.gen. |Resturlaub\n", "00104 150356 1 | ‚ev 30 400 3000 3400\n", "\n", "SV-Nummer |Krankenkasse KK%®|PGRS Bars jum.SV-Tg. Anw. Tage |UrlaubTage Krankh. Tg. Fehlz. Tage\n", "\n", "50150356B581 AOK Bayern Die Gesundheitskas 157 101 1111 1 30\n", "\n", " Eintritt ‚Austritt Anw.Std. Urlaub Std. |Krankh. Std. |Fehlz. Std.\n", "\n", " 170299 L L l L l l\n", " - + Steuer-ID IMrB? Zeitlohn Sta.|Überstd. |Bez. Sta.\n", " Teststraße123\n", " 12345 Testort 12345678911 \\ ı ı \\\n", " B/N\n", " Pers.-Nr. 00104 x02\n", " Abt.-Nr. A12 10103 HinweisezurAbrechnung\n", "```\n", "\n", "**pages.json5** - Contains information of each Page of the Document (for example, their ids and sizes).\n", "\n", "```\n", "[\n", " {\n", " \"id\": 1923,\n", " \"image\": \"/page/show/1923/\",\n", " \"number\": 1,\n", " \"original_size\": [\n", " 595.2,\n", " 841.68\n", " ],\n", " \"size\": [\n", " 1414,\n", " 2000\n", " ]\n", " }\n", "]\n", "```\n", "\n", "**annotation_sets.json5** - Contains information of each Annotation Set in the Document and Annotations that constitute\n", "it.\n", "\n", "```\n", "{\n", " \"id\": 78730,\n", " \"label_set\": {\n", " \"api_name\": \"Lohnabrechnung\",\n", " \"description\": \"\",\n", " \"has_multiple_annotation_sets\": false,\n", " \"id\": 63,\n", " \"name\": \"Lohnabrechnung\"\n", " },\n", " \"labels\": [\n", " {\n", " \"annotations\": [\n", " {\n", " \"confidence\": 0.93,\n", " \"created_by\": \"user@mail.com\",\n", " \"custom_offset_string\": false,\n", " \"Document\": 44823,\n", " \"id\": 4420351,\n", " \"is_correct\": true,\n", " \"normalized\": 2189.07,\n", " \"offset_string\": \"2.189,07\",\n", " \"offset_string_original\": \"2.189,07\",\n", " \"origin\": \"api.v2\",\n", " \"revised\": false,\n", " \"revised_by\": null,\n", " \"selection_bbox\": {\n", " \"page_index\": 0,\n", " \"x0\": 516.48,\n", " \"x1\": 562.8,\n", " \"y0\": 76.829,\n", " \"y1\": 87.829\n", " },\n", " \"span\": [\n", " {\n", " \"end_offset\": 3785,\n", " \"offset_string\": \"2.189,07\",\n", " \"offset_string_original\": \"2.189,07\",\n", " \"page_index\": 0,\n", " \"start_offset\": 3777,\n", " \"x0\": 516.48,\n", " \"x1\": 562.8,\n", " \"y0\": 76.829,\n", " \"y1\": 87.829\n", " }\n", " ],\n", " \"translated_string\": null\n", " }\n", "```\n", "\n", "**annotations.json5** - Contains information of each Annotation in the Document (for example, their Labels and Bounding \n", "Boxes).\n", "\n", "```\n", "[\n", " {\n", " \"accuracy\": null,\n", " \"bbox\": {\n", " \"bottom\": 44.369,\n", " \"line_index\": 1,\n", " \"page_index\": 0,\n", " \"top\": 35.369,\n", " \"x0\": 468.48,\n", " \"x1\": 527.04,\n", " \"y0\": 797.311,\n", " \"y1\": 806.311\n", " },\n", " \"bboxes\": [\n", " {\n", " \"bottom\": 44.369,\n", " \"end_offset\": 169,\n", " \"line_number\": 2,\n", " \"offset_string\": \"22.05.2018\",\n", " \"offset_string_original\": \"22.05.2018\",\n", " \"page_index\": 0,\n", " \"start_offset\": 159,\n", " \"top\": 35.369,\n", " \"x0\": 468.48,\n", " \"x1\": 527.04,\n", " \"y0\": 797.311,\n", " \"y1\": 806.311\n", " }\n", " ],\n", " \"created_by\": 59,\n", " \"custom_offset_string\": false,\n", " \"end_offset\": 169,\n", " \"get_created_by\": \"user@mail.com\",\n", " \"get_revised_by\": \"n/a\",\n", " \"id\": 4419937,\n", " \"is_correct\": true,\n", " \"label\": 867,\n", " \"label_data_type\": \"Date\",\n", " \"label_text\": \"Austellungsdatum\",\n", " \"label_threshold\": 0.1,--\n", " \"normalized\": \"2018-05-22\",\n", " \"offset_string\": \"22.05.2018\",\n", " \"offset_string_original\": \"22.05.2018\",\n", " \"revised\": false,\n", " \"revised_by\": null,\n", " \"section\": 78730,\n", " \"section_label_id\": 63,\n", " \"section_label_text\": \"Lohnabrechnung\",\n", " \"selection_bbox\": {\n", " \"bottom\": 44.369,\n", " \"line_index\": 1,\n", " \"page_index\": 0,\n", " \"top\": 35.369,\n", " \"x0\": 468.48,\n", " \"x1\": 527.04,\n", " \"y0\": 797.311,\n", " \"y1\": 806.311\n", " },\n", " \"start_offset\": 159,\n", " \"translated_string\": null\n", " },\n", "...\n", "]\n", "```\n", "\n", "When needed, upon calling `Document.get_bbox()`, an additional file will be downloaded to the Document folder containing the Bounding Boxes information of the characters of the Document: **bbox.zip**. This file can be quite large, and therefore it will be compressed in the Zip format. The decompressed file is a JSON file where the keys correspond to the indices of the characters in the Document text. The value associated with each key contains the Bounding Box information of the character. For example, for character 1000 and 1002 we would have:\n", " \n", "```\n", "{\n", " \"1000\": {\n", " \"adv\": 2.58,\n", " \"bottom\": 128.13,\n", " \"doctop\": 118.13,\n", " \"fontname\": \"GlyphLessFont\",\n", " \"height\": 10.0,\n", " \"line_number\": 14,\n", " \"object_type\": \"char\",\n", " \"page_number\": 1,\n", " \"size\": 10.0,\n", " \"text\": \"n\",\n", " \"top\": 118.13,\n", " \"upright\": 1,\n", " \"width\": 2.58,\n", " \"x0\": 481.74,\n", " \"x1\": 484.32,\n", " \"y0\": 713.55,\n", " \"y1\": 723.55\n", " },\n", " \"1002\": {\n", " \"adv\": 2.64,\n", " \"bottom\": 128.13,\n", " \"doctop\": 118.13,\n", " \"fontname\": \"GlyphLessFont\",\n", " \"height\": 10.0,\n", " \"line_number\": 14,\n", " \"object_type\": \"char\",\n", " \"page_number\": 1,\n", " \"size\": 10.0,\n", " \"text\": \"S\",\n", " \"top\": 118.13,\n", " \"upright\": 1,\n", " \"width\": 2.64,\n", " \"x0\": 486.72,\n", " \"x1\": 489.36,\n", " \"y0\": 713.55,\n", " \"y1\": 723.55\n", " },\n", "// ...\n", "}\n", "```\n", "\n", "After downloading these files, their paths will become available in the Project instance.\n", "\n", "You can get the path to the folder containing the Documents' folders with:" ] }, { "cell_type": "code", "execution_count": null, "id": "d7bacb85", "metadata": {}, "outputs": [], "source": [ "my_project.documents_folder" ] }, { "cell_type": "markdown", "id": "d9af8108", "metadata": { "lines_to_next_cell": 0 }, "source": [ "And you can get the path to the file with the Document text with:" ] }, { "cell_type": "code", "execution_count": null, "id": "c793f1bc", "metadata": { "lines_to_next_cell": 0, "tags": [ "remove-cell" ] }, "outputs": [], "source": [ "original_document_text = Project(id_=46).get_document_by_id(44823).text\n", "document = [document for document in my_project.documents if document.text == original_document_text][0]" ] }, { "cell_type": "code", "execution_count": null, "id": "b4234afb", "metadata": {}, "outputs": [], "source": [ "document.txt_file_path" ] }, { "cell_type": "markdown", "id": "f9de8bb1", "metadata": { "lines_to_next_cell": 0 }, "source": [ "#### Upload Document\n", "\n", "Before you can upload a new file to your Project using the Konfuzio SDK, you must have completed the following steps:\n", "\n", "1. Register for a Konfuzio account\n", "2. Create a Project on Konfuzio\n", "3. Install the Konfuzio SDK\n", "\n", "For detailed instructions on these preliminary steps, refer above to the [Get Started guide](https://dev.konfuzio.com/sdk/get_started.html#get-started).\n", "\n", "After completing the above steps, you can proceed with uploading a new file to your Project using the Konfuzio SDK. The \n", "files must be of types specified in the [Supported File Types](https://help.konfuzio.com/specification/supported_file_types/index.html). \n", "Here, we're focusing on the `Document.from_file` method to create a [Konfuzio Document](https://dev.konfuzio.com/sdk/sourcecode.html#document).\n", "\n", "A Konfuzio Document is an object representing the file you upload, it will contain the OCR (Optical Character Recognition) \n", "information of the file once processed by Konfuzio's server.\n", "\n", "###### Synchronous and Asynchronous Upload\n", "\n", "You have two options for uploading your file: a synchronous method and an asynchronous method. The method is determined \n", "by the `sync` parameter in the `from_file` method.\n", "\n", "1. **Synchronous upload (sync=True)**: The file is uploaded to the Konfuzio servers, and the method waits for the \n", "file to be processed. Once done, it returns a Document object with the OCR information. This is useful if you want \n", "to start working with the Document immediately after the OCR processing is completed.\n", "\n", " Here's an example of how to use the `from_file` method with `sync` set to `True`:" ] }, { "cell_type": "code", "execution_count": null, "id": "04e9b5d6", "metadata": { "lines_to_next_cell": 0, "tags": [ "remove-cell" ] }, "outputs": [], "source": [ "import time\n", "FILE_PATH = 'tests/test_data/pdf.pdf'\n", "ASSIGNEE_ID = None" ] }, { "cell_type": "code", "execution_count": null, "id": "41c7efd1", "metadata": { "tags": [ "skip-execution", "nbval-skip" ] }, "outputs": [], "source": [ "document = Document.from_file(FILE_PATH, project=my_project, sync=True)" ] }, { "cell_type": "markdown", "id": "af75eeca", "metadata": {}, "source": [ "2. **Asynchronous upload (sync=False)**: With this setting, the method immediately returns an empty Document object \n", "after initiating the upload. The OCR processing takes place in the background. This method is advantageous when \n", "uploading a large file or a large number of files, as it doesn't require waiting for each file's processing to complete.\n", "\n", " Here is how to use the asynchronous method:" ] }, { "cell_type": "code", "execution_count": null, "id": "8bc80b00", "metadata": { "tags": [ "skip-execution", "nbval-skip" ] }, "outputs": [], "source": [ "document = Document.from_file(FILE_PATH, project=my_project, sync=False)" ] }, { "cell_type": "markdown", "id": "ebe18f7c", "metadata": {}, "source": [ "After asynchronous upload, you can check the status of the Document processing using the `Document.update()` method on \n", "the returned Document object. If the Document is ready, this method will update the Document object with the OCR information.\n", "\n", "It's important to note that if the Document is not ready, you may need to call `Document.update()` again at a later time. \n", "This could be done manually or by setting up a looping mechanism depending on your application's workflow.\n", "\n", "To check if the Document is ready and update it with the OCR information, you can implement a custom pulling strategy \n", "like this:" ] }, { "cell_type": "code", "execution_count": null, "id": "3d243a27", "metadata": { "tags": [ "skip-execution", "nbval-skip" ] }, "outputs": [], "source": [ "for i in range(2):\n", " document.update()\n", " if document.ocr_ready is True:\n", " print(document.text)\n", " break\n", " time.sleep(i * 10 + 3)" ] }, { "cell_type": "markdown", "id": "cadde092", "metadata": {}, "source": [ "For a more sophisticated pulling method for asynchronously uploaded Documents using the callback function, you can \n", "checkout our :ref:`tutorial on how to use ngrok to receive callbacks from the Konfuzio Server`.\n", "\n", "###### Timeout Parameter\n", "\n", "When making a server request, there's a default timeout value of 2 minutes. This means that if the server doesn't respond \n", "within 2 minutes, the operation will stop waiting for a response and return an error. If you're uploading a larger file, \n", "it might take more time to process, and the default timeout value might not be sufficient. In such a case, you can \n", "increase the timeout by setting the timeout parameter to a higher value." ] }, { "cell_type": "code", "execution_count": null, "id": "44fe7e11", "metadata": { "lines_to_next_cell": 2, "tags": [ "skip-execution", "nbval-skip" ] }, "outputs": [], "source": [ "document = Document.from_file(FILE_PATH, project=my_project, timeout=300, sync=True)" ] }, { "cell_type": "markdown", "id": "f1a0fe62", "metadata": {}, "source": [ "#### Modify Document\n", "\n", "If you would like to use the SDK to modify some Document's meta-data like the dataset status or the assignee, you can do\n", "it like this:" ] }, { "cell_type": "code", "execution_count": null, "id": "dd935434", "metadata": { "tags": [ "skip-execution", "nbval-skip" ] }, "outputs": [], "source": [ "document.assignee = ASSIGNEE_ID\n", "document.dataset_status = 2\n", "\n", "document.save_meta_data()" ] }, { "cell_type": "markdown", "id": "d21307ce", "metadata": {}, "source": [ "#### Update Document\n", "If there are changes in the Document in the Konfuzio Server, you can update your local version of the Document with:" ] }, { "cell_type": "code", "execution_count": null, "id": "c1e76c67", "metadata": { "tags": [ "skip-execution", "nbval-skip" ] }, "outputs": [], "source": [ "document.update()" ] }, { "cell_type": "markdown", "id": "2b81dd77", "metadata": {}, "source": [ "If a Document is part of the Training or Test set, you can also update it by updating the entire Project via\n", "`Project.get(update=True)`. However, for Projects with many Documents it can be faster to update only the relevant Documents.\n", "\n", "#### Download PDFs\n", "To get the PDFs of the Documents, you can use `get_file()`." ] }, { "cell_type": "code", "execution_count": null, "id": "5b58bb74", "metadata": { "tags": [ "skip-execution", "nbval-skip" ] }, "outputs": [], "source": [ "for document in my_project.documents:\n", " document.get_file()" ] }, { "cell_type": "markdown", "id": "17eec587", "metadata": {}, "source": [ "This will download the OCR version of the Document which contains the text, the Bounding Boxes\n", "information of the characters and the image of the Document.\n", "\n", "In the Document folder, you will see a new file with the original name followed by \"_ocr\".\n", "\n", "If you want to original version of the Document (without OCR) you can use `ocr_version=False`." ] }, { "cell_type": "code", "execution_count": null, "id": "f2079184", "metadata": { "tags": [ "skip-execution", "nbval-skip" ] }, "outputs": [], "source": [ "for document in my_project.documents:\n", " document.get_file(ocr_version=False)" ] }, { "cell_type": "markdown", "id": "a37834ec", "metadata": {}, "source": [ "In the Document folder, you will see a new file with the original name.\n", "\n", "#### Download pages as images\n", "To get the Pages of the Document as png images, you can use `get_images()`." ] }, { "cell_type": "code", "execution_count": null, "id": "670b4a4b", "metadata": { "tags": [ "skip-execution", "nbval-skip" ] }, "outputs": [], "source": [ "for document in my_project.documents:\n", " document.get_images()" ] }, { "cell_type": "markdown", "id": "6a1f50a3", "metadata": {}, "source": [ "You will get one png image named \"page_number_of_page.png\" for each Page in the Document.\n", "\n", "#### Download bounding boxes of the characters\n", "To get the Bounding Boxes information of the characters, you can use `get_bbox()`." ] }, { "cell_type": "code", "execution_count": null, "id": "2930436d", "metadata": { "tags": [ "skip-execution", "nbval-skip" ] }, "outputs": [], "source": [ "for document in my_project.documents:\n", " document.get_bbox()" ] }, { "cell_type": "markdown", "id": "007121fa", "metadata": {}, "source": [ "You will get a file named \"bbox.zip\" in the Document folder. This file contains the \"bbox.json5\" file. You can find the\n", "path to the zip file in the Document instance with:" ] }, { "cell_type": "code", "execution_count": null, "id": "3bcd7f60", "metadata": { "tags": [ "skip-execution", "nbval-skip" ] }, "outputs": [], "source": [ "document.bbox_file_path" ] }, { "cell_type": "markdown", "id": "f034fa5a", "metadata": {}, "source": [ "#### Delete Document\n", "\n", "##### Delete Document Locally\n", "To locally delete a Document, you can use:" ] }, { "cell_type": "code", "execution_count": null, "id": "d3e26bf3", "metadata": { "tags": [ "skip-execution", "nbval-skip" ] }, "outputs": [], "source": [ "document.delete()" ] }, { "cell_type": "markdown", "id": "7afb5302", "metadata": {}, "source": [ "The Document will be deleted from your local data folder, but it will remain in the Konfuzio Server.\n", "If you want to get it again you can update the Project.\n", "\n", "##### Delete Document Online\n", "\n", "If you would like to delete a Document in the remote server you can simply use the `Document.delete` method the `delete_online` setting set to `True`. You can only delete Documents with a dataset status of None (0). **Be careful!** Once the Document is deleted online, we will have no way of recovering it. " ] }, { "cell_type": "code", "execution_count": null, "id": "b599f52e", "metadata": { "tags": [ "skip-execution", "nbval-skip" ] }, "outputs": [], "source": [ "document.delete(delete_online=True)" ] }, { "cell_type": "markdown", "id": "ce058ca4", "metadata": {}, "source": [ "If `delete_online` is set to False (the default), the Document will only be deleted on your local machine, and will be \n", "reloaded next time you load the Project." ] } ], "metadata": { "kernelspec": { "display_name": "Python 3 (ipykernel)", "language": "python", "name": "python3" } }, "nbformat": 4, "nbformat_minor": 5 }