Asynchronously uploading and processing multiple files using webhooks¶
Prerequisites:
Data Layer concepts of Konfuzio: Project, Document
Understanding of API
Basic knowledge of web servers and how they handle incoming requests
Basic understanding of asynchronous programming concepts, including threads and event handling
Familiarity with ngrok and its setup process to expose a local server to the internet
Basic knowledge of HTTP protocols, particularly POST requests and JSON payloads
Comfortable navigating and executing commands in a terminal or command prompt
Difficulty: Hard
Goal:
This tutorial aims to guide users in efficiently uploading a large number of files to Konfuzio by utilizing the Document.from_file
(see docs) method in asynchronous mode. The primary objectives achieved through this tutorial are:
Efficient File Upload
Real-time Notifications
Seamless Integration with ngrok
Automated Update of Files
Environment¶
You need to install the Konfuzio SDK before diving into the tutorial.
To get up and running quickly, you can use our Colab Quick Start notebook.
As an alternative you can follow the installation section to install and initialize the Konfuzio SDK locally or on an environment of your choice.
Introduction¶
Uploading a large number of files to Konfuzio can be made highly convenient by employing the Document.from_file
method in asynchronous mode. This approach allows for the simultaneous upload of multiple files without the need to wait for backend processing. However, a drawback of this method is the lack of real-time updates on processing status. This tutorial outlines a solution using a webhook callback URL to receive notifications once processing is complete, allowing for timely access to results.
Preliminary Steps¶
Install Flask
Install Flask, which we will use to create a simple web server that will receive the callback from the Konfuzio Server. You can do it using pip:
pip install flask
Set up ngrok
Then you will need to set up ngrok. If you already have a public web server able to receive POST calls, you can ignore this step and just use the callback URL to your web server’s callback end point. To set up ngrok, first create an account on the ngrok website. It is free, and you can use your GitHub or Google account.
Once logged into ngrok, follow the simple instructions available at https://dashboard.ngrok.com/get-started/setup
On linux, all you need to do is:
Download ngrok
Follow the instructions to add the authentication token
Run this in a terminal:
./ngrok http 5000
This should give you the URL you can use as a callback URL. It should look something like “https://abcd-12-34-56-789.ngrok-free.app”.
Now that we have ngrok set up, we can see how to use it to pull the results of asynchronously uploaded files.
Retrieving asynchronously uploaded files using a callback URL¶
Import the necessary modules:
from flask import Flask, request
from konfuzio_sdk.data import Project, Document
import threading
from werkzeug.serving import run_simple
Initialize the Project:
project = Project(id_=YOUR_PROJECT_ID)
Create a Flask application:
app = Flask(__name__)
Set the callback URL. You will find this callback url in the ngrok console where you ran ./ngrok http 5000
. It should look like “https://abcd-12-34-56-789.ngrok-free.app”.
callback_url = YOUR_CALLBACK_URL
Initialize data structures to share information between the threads. We will use the main thread to host our Flask application and to receive the callback responses. We will use a separate thread to send the files to the Konfuzio Server. So, we will use the callback_data_dict
to store the callback responses. The data_lock
will be used to synchronize access to the callback_data_dict
between the two threads, so that we can safely access it from both threads.
callback_data_dict = {}
data_lock = threading.Lock()
Now we can create the callback function that will receive the callback responses from the Konfuzio server. We store the callback response in the callback_data_dict
and set the callback_received
event to notify the thread which is sending the files that the callback response has been received and that the files can be updated with the
new OCR information.
@app.route('/', methods=['POST'])
def callback():
data = request.json
file_name = data.get('data_file_name')
with data_lock:
if file_name is not None and file_name in callback_data_dict:
callback_data_dict[file_name]['callback_data'] = data
callback_data_dict[file_name]['callback_received'].set()
return '', 200
Next, create the function that will send the files to the Konfuzio Server. We create a Document object for each file and set the sync
parameter to False
to indicate that we want to upload the files asynchronously. We also set the callback_url
parameter to the callback URL we created earlier.
We then start a thread for each Document to wait for the callback response to be received. Once the callback response for a Document has been received, we can update it with the OCR information.
def update_file(document, file_name):
print(f'Waiting for callback for {document}')
callback_data_dict[file_name]['callback_received'].wait()
print(f'Received callback for {document}')
document.update()
assert document.ocr_ready
print(f'Updated {document} information with OCR results')
def send_files(file_names):
for file_name in file_names:
with data_lock:
callback_data_dict[file_name] = {'callback_received': threading.Event(), 'callback_data': None, 'document': None}
print(f'Sending {file_name} to Konfuzio servers...')
document = Document.from_file(file_name, project=project, sync=False, callback_url=callback_url)
with data_lock:
callback_data_dict[file_name]['document'] = document
for file_name in callback_data_dict:
threading.Thread(target=update_file, args=(callback_data_dict[file_name]['document'], file_name,)).start()
Finally, we can start the Flask application and send the files. Add the path to all the files you want to upload.
if __name__=='__main__':
thread = threading.Thread(target=lambda: run_simple("0.0.0.0", 5000, app))
thread.start()
file_names = ['LIST.pdf', 'OF.jpg', 'FILES.tiff']
threading.Thread(target=send_files, args=(file_names,)).start()
Conclusion¶
In this tutorial, we explored a powerful method for efficiently uploading a large number of files to Konfuzio using asynchronous mode and a webhook callback URL. By leveraging the Document.from_file method and ngrok for exposing a local server, we’ve enabled simultaneous file uploads without the need to wait for backend processing. Additionally, the implementation of a callback function ensures real-time notifications, allowing for timely access to results.
By following this tutorial, you’ve gained valuable insights into the seamless integration of Konfuzio with ngrok, optimizing your workflow for Document processing tasks. This approach not only enhances efficiency but also provides a foundation for building robust, automated solutions for Document management and analysis.