{ "cells": [ { "cell_type": "markdown", "id": "55e752d1", "metadata": {}, "source": [ "## Asynchronously uploading and processing multiple files using webhooks\n", "\n", "---\n", "\n", "**Prerequisites:**\n", "- Data Layer concepts of Konfuzio: Project, Document\n", "- Understanding of API\n", "- Basic knowledge of web servers and how they handle incoming requests\n", "- Basic understanding of asynchronous programming concepts, including threads and event handling\n", "- Familiarity with ngrok and its setup process to expose a local server to the internet\n", "- Basic knowledge of HTTP protocols, particularly POST requests and JSON payloads\n", "- Comfortable navigating and executing commands in a terminal or command prompt\n", "\n", "**Difficulty:** Hard\n", "\n", "**Goal:**\n", "This tutorial aims to guide users in efficiently uploading a large number of files to Konfuzio by utilizing the `Document.from_file` (see [docs](https://dev.konfuzio.com/sdk/sourcecode.html#document)) method in asynchronous mode. The primary objectives achieved through this tutorial are:\n", "\n", "- Efficient File Upload\n", "- Real-time Notifications\n", "- Seamless Integration with ngrok\n", "- Automated Update of Files\n", "---\n", "\n", "### Environment\n", "You need to install the Konfuzio SDK before diving into the tutorial. \\\n", "To get up and running quickly, you can use our Colab Quick Start notebook. \\\n", "\"Open\n", "\n", "As an alternative you can follow the [installation section](../get_started.html#install-sdk) to install and initialize the Konfuzio SDK locally or on an environment of your choice.\n", "\n", "### Introduction\n", "\n", "Uploading a large number of files to Konfuzio can be made highly convenient by employing the `Document.from_file` method in asynchronous mode. This approach allows for the simultaneous upload of multiple files without the need to wait for backend processing. However, a drawback of this method is the lack of real-time updates on processing status. This tutorial outlines a solution using a webhook callback URL to receive notifications once processing is complete, allowing for timely access to results." ] }, { "cell_type": "markdown", "id": "6966e2b1", "metadata": {}, "source": [ "### Preliminary Steps" ] }, { "cell_type": "markdown", "id": "58a7c597", "metadata": {}, "source": [ "**Install Flask**\n", "\n", "Install Flask, which we will use to create a simple web server that will receive the callback from the Konfuzio\n", "Server. You can do it using pip:\n", "\n", "```console\n", "pip install flask\n", "```\n", "\n", "**Set up ngrok**\n", "\n", "Then you will need to set up ngrok. If you already have a public web server able to receive POST calls, you can \n", "ignore this step and just use the callback URL to your web server's callback end point. To set up ngrok, first \n", "create an account on the [ngrok website](https://ngrok.com/). It is free, and you can use your GitHub or Google \n", "account.\n", "\n", "Once logged into ngrok, follow the simple instructions available at https://dashboard.ngrok.com/get-started/setup\n", "\n", "On linux, all you need to do is:\n", "- Download ngrok\n", "- Follow the instructions to add the authentication token\n", "- Run this in a terminal:\n", "\n", "```console\n", "./ngrok http 5000\n", "```\n", "\n", "This should give you the URL you can use as a callback URL. It should look something like \n", "\"https://abcd-12-34-56-789.ngrok-free.app\".\n", "\n", "Now that we have ngrok set up, we can see how to use it to pull the results of asynchronously uploaded files." ] }, { "cell_type": "markdown", "id": "01a9be91", "metadata": {}, "source": [ "### Retrieving asynchronously uploaded files using a callback URL\n", "\n", "Import the necessary modules:" ] }, { "cell_type": "code", "execution_count": null, "id": "f0b2cedb", "metadata": { "tags": [ "skip-execution", "nbval-skip" ] }, "outputs": [], "source": [ "from flask import Flask, request\n", "from konfuzio_sdk.data import Project, Document\n", "import threading\n", "from werkzeug.serving import run_simple" ] }, { "cell_type": "markdown", "id": "82e00985", "metadata": {}, "source": [ "Initialize the Project: " ] }, { "cell_type": "code", "execution_count": null, "id": "7a3ca942", "metadata": { "tags": [ "skip-execution", "nbval-skip" ] }, "outputs": [], "source": [ "project = Project(id_=YOUR_PROJECT_ID)" ] }, { "cell_type": "markdown", "id": "f6af3de2", "metadata": {}, "source": [ "Create a Flask application:" ] }, { "cell_type": "code", "execution_count": null, "id": "f8ecc262", "metadata": { "tags": [ "skip-execution", "nbval-skip" ] }, "outputs": [], "source": [ "app = Flask(__name__)" ] }, { "cell_type": "markdown", "id": "fe58ab66", "metadata": {}, "source": [ "Set the callback URL. You will find this callback url in the ngrok console where you ran `./ngrok http 5000`. It should look like \"https://abcd-12-34-56-789.ngrok-free.app\"." ] }, { "cell_type": "code", "execution_count": null, "id": "0d62cdde", "metadata": { "tags": [ "skip-execution", "nbval-skip" ] }, "outputs": [], "source": [ "callback_url = YOUR_CALLBACK_URL" ] }, { "cell_type": "markdown", "id": "3bbb72fb", "metadata": {}, "source": [ "Initialize data structures to share information between the threads. We will use the main thread to host our Flask application and to receive the callback responses. We will use a separate thread to send the files to the Konfuzio Server. So, we will use the `callback_data_dict` to store the callback responses. The `data_lock` will be used to synchronize access to the `callback_data_dict` between the two threads, so that we can safely access it from both threads." ] }, { "cell_type": "code", "execution_count": null, "id": "70de2d1d", "metadata": { "tags": [ "skip-execution", "nbval-skip" ] }, "outputs": [], "source": [ "callback_data_dict = {}\n", "data_lock = threading.Lock()" ] }, { "cell_type": "markdown", "id": "a4e2df11", "metadata": {}, "source": [ "Now we can create the callback function that will receive the callback responses from the Konfuzio server. We store the callback response in the `callback_data_dict` and set the `callback_received` event to notify the thread which is sending the files that the callback response has been received and that the files can be updated with the \n", "new OCR information." ] }, { "cell_type": "code", "execution_count": null, "id": "df9455fc", "metadata": { "tags": [ "skip-execution", "nbval-skip" ] }, "outputs": [], "source": [ "@app.route('/', methods=['POST'])\n", "def callback():\n", " data = request.json\n", " file_name = data.get('data_file_name')\n", " with data_lock:\n", " if file_name is not None and file_name in callback_data_dict:\n", " callback_data_dict[file_name]['callback_data'] = data\n", " callback_data_dict[file_name]['callback_received'].set()\n", " return '', 200" ] }, { "cell_type": "markdown", "id": "4ce2d9a4", "metadata": {}, "source": [ "Next, create the function that will send the files to the Konfuzio Server. We create a [Document](https://dev.konfuzio.com/sdk/sourcecode.html#document) object for each file and set the `sync` parameter to `False` to indicate that we want to upload the files asynchronously. We also set the `callback_url` parameter to the callback URL we created earlier.\n", "\n", "We then start a thread for each Document to wait for the callback response to be received. Once the callback response for a Document has been received, we can update it with the OCR information." ] }, { "cell_type": "code", "execution_count": null, "id": "457c379f", "metadata": { "tags": [ "skip-execution", "nbval-skip" ] }, "outputs": [], "source": [ "def update_file(document, file_name):\n", " print(f'Waiting for callback for {document}')\n", " callback_data_dict[file_name]['callback_received'].wait()\n", " \n", " print(f'Received callback for {document}')\n", " document.update()\n", " assert document.ocr_ready\n", "\n", " print(f'Updated {document} information with OCR results')\n", "\n", "def send_files(file_names):\n", " for file_name in file_names:\n", " with data_lock:\n", " callback_data_dict[file_name] = {'callback_received': threading.Event(), 'callback_data': None, 'document': None}\n", " print(f'Sending {file_name} to Konfuzio servers...')\n", " document = Document.from_file(file_name, project=project, sync=False, callback_url=callback_url)\n", " with data_lock:\n", " callback_data_dict[file_name]['document'] = document\n", "\n", " for file_name in callback_data_dict:\n", " threading.Thread(target=update_file, args=(callback_data_dict[file_name]['document'], file_name,)).start()" ] }, { "cell_type": "markdown", "id": "f391a77f", "metadata": {}, "source": [ "Finally, we can start the Flask application and send the files. Add the path to all the files you want to upload. " ] }, { "cell_type": "code", "execution_count": null, "id": "78bf9440", "metadata": { "tags": [ "skip-execution", "nbval-skip" ] }, "outputs": [], "source": [ "if __name__=='__main__':\n", " thread = threading.Thread(target=lambda: run_simple(\"0.0.0.0\", 5000, app))\n", " thread.start()\n", " file_names = ['LIST.pdf', 'OF.jpg', 'FILES.tiff']\n", " threading.Thread(target=send_files, args=(file_names,)).start()" ] }, { "cell_type": "markdown", "id": "c8c4b88e", "metadata": {}, "source": [ "### Conclusion\n", "\n", "In this tutorial, we explored a powerful method for efficiently uploading a large number of files to Konfuzio using asynchronous mode and a webhook callback URL. By leveraging the Document.from_file method and ngrok for exposing a local server, we've enabled simultaneous file uploads without the need to wait for backend processing. Additionally, the implementation of a callback function ensures real-time notifications, allowing for timely access to results.\n", "\n", "By following this tutorial, you've gained valuable insights into the seamless integration of Konfuzio with ngrok, optimizing your workflow for Document processing tasks. This approach not only enhances efficiency but also provides a foundation for building robust, automated solutions for Document management and analysis." ] } ], "metadata": { "kernelspec": { "display_name": "konfuzio", "language": "python", "name": "python3" } }, "nbformat": 4, "nbformat_minor": 5 }