Explanations¶
Explanation is discussion that clarifies and illuminates a particular topic. Explanation is understanding-oriented.
Architectural Overview¶
The diagram illustrates the components of a Konfuzio Server deployment. Optional components are represented by dashed lines. The numbers in brackets represent the minimal and maximal container count per component.
OCR Processing¶
After uploading a document, depending on how your project was set up, the document is automatically queued and processed by our OCR. In the below breakdown we try to demystify this process, and give you some insight into all the steps which happen during OCR processing.
Project settings¶
We first look at the projects settings to see what base settings have been set up for your documents.
Chosen OCR engine
easy
precise
Chosen processing type
OCR
Embedding
Embedding and OCR
Chosen auto-rotation option
(Only available for precise OCR engine)
None
Rounded
Exact
File pre-processing¶
During file upload, after the Project settings have been evaluated, we look at the file:
We check if the filetype is supported.
We check if the file is valid and/or corrupted.
If the file is corrupted, some repairing is attempted.
We check if the filetype provides embeddings.
We check if the project enforces OCR.
We then conduct OCR on the file.
We check if the image is angled.
We create thumbnails per page.
We create images per page.
OCR Text extraction¶
During evaluation of both project settings and file, we also process OCR extraction
We use the chosen engine on the pre-processed file.
If “Embedding and OCR” is chosen, internally we check which processing type is the most suitable, and use either Embedding or OCR
Depending on chosen processing type, some pre-processing may be done:
Convert non-PDF Documents to a PDF that is being used here
Convert PDF to text (in case of embeddings)
If some sort of PDF corruption is detected, within our ability we attempt to repair the PDF
If the PDF or TIFF is multi page (and valid) we split the document in pages and process each page separately
We check whether auto-rotation was chosen when the precise OCR engine is used]
If rounded angle correction was chosen, we rotate the image to the nearest 45/90 degrees.
If exact angle rotation was chosen, we rotate the image at its exact angle rotation value.
We attempt to extract the text from (either ocr, embedded or both)
OCR may fail because text on the document is technically unreadable, the file is corrupted or empty and cannot be repaired
OCR may fail because engine does not support the text language
Finally, we return you the extracted text.
Background Processes¶
Processes within our server are distributed between Celery workers between several tasks. Together this creates the definition of Servers internal workflow. Below the individual tasks of the Server’s workflow are described in order of their triggered events. Tasks are run in parallel queue’s which are grouped in celery chords. While some tasks in each queue run in parallel, some tasks are still dependent on others. And no next queue will start until all tasks in the queue are finished. More on Celery workflows can be found here: https://docs.celeryq.dev/en/stable/userguide/canvas.html
Queues¶
queue-name |
description |
---|---|
|
ocr tasks |
|
image generation tasks |
|
splitting tasks |
|
categorization tasks |
|
extraction tasks |
|
detectron tasks |
|
end queue after OCR and extraction has occurred |
|
queue for AI training |
|
queue for RAM intensive ai training |
|
queue for AI evaluation |
Note that all soft time limits are set relatively low and may be unsuitable for processing larger Document collections with hundreds of files. If you encounter errors about exceeded soft time limits, you can refer to the numbers in Time limits for background tasks and increase them.
Reset the Queue¶
On self-hosted Konfuzio installations, all queues can be reset by running redis-cli FLUSHALL
.
Please be aware that you usually do not want to do this, as it will cause Documents and AIs to be stuck in their current status.
Celery Tasks¶
Document tasks¶
Series of events & tasks triggered when uploading a Document
Queue |
task name |
description |
default time limit |
---|---|---|---|
ocr, local_ocr |
page_ocr |
Apply OCR to the documents page(s). |
10 minutes |
processing |
page_image |
Create png image for a page. If a PNG was submitted or already exists it will be returned without regeneration. |
10 minutes |
detectron |
page_detectron |
Optional: Run detectron on a page. |
10 minutes |
detectron |
document_finalize_detectron |
Optional: Collect per-page detectron results. |
10 minutes |
ocr |
set_document_text_and_bboxes |
Collect the result of the pages OCR and set text & bboxes. |
3 minute |
split |
document_propose_split |
Optional: Split the document. |
60 minutes |
categorize |
document_categorize |
Optional: Categorize the document. |
60 minutes |
extract |
document_extract |
Extract the document using the AI models linked to the Project. |
60 minutes |
finalize |
build_sandwich |
Optional: Generates the pdfsandwich for a submitted PDF |
30 minutes |
finalize |
generate_entities |
Generate entities for a document which are shown in the labeling tool. |
60 minutes |
The overall time to complete all tasks related to a Document (DOCUMENT_WORKFLOW_TIME_LIMIT) is restricted to 2 hours.
Extraction, Category & Splitting AI Training¶
Extraction AI¶
Series of events triggered when training an extraction AI
Queue |
task name |
description |
default time limit |
---|---|---|---|
training, training_heavy |
train_extraction_ai |
Start the training of the Ai model. |
20 hours |
extract |
document_extract |
Extract the document using the AI models linked to the Project. |
60 minutes |
training, training_heavy |
evaluate_ai_model |
Evaluate the trained Ai models performance. |
60 minutes |
Category AI¶
Series of events triggered when training a Categorization AI
Queue |
task name |
description |
default time limit |
---|---|---|---|
training, training_heavy |
train_category_ai |
Start the training of the categorization model. |
20 hours |
categorize |
document_categorize |
Run the categorization against all Documents in the its category. |
60 minutes |
training, training_heavy |
evaluate_ai_model |
Evaluate the categorization Ai models performance. |
60 minutes |
Splitting AI¶
Series of events triggered when training a Splitting AI
Queue |
task name |
description |
default time limit |
---|---|---|---|
training, training_heavy |
train_splitting_ai |
Start the training of the categorization model. |
20 hours |
split |
document_propose_split |
Run the categorization against all Documents in the its category. |
60 minutes |
training, training_heavy |
evaluate_ai_model |
Evaluate the categorization Ai models performance. |
60 minutes |
Security¶
We prioritize the security of our software and the data it manages. Whether you are using our SaaS solution or deploying our software on-premise, we have implemented a range of security measures to ensure the integrity and confidentiality of your data.
Below are some of the key security features and best practices we have integrated:
Non-Root Containers¶
Running containers as a non-root user is a best practice in container security. By default, our Docker container runs as a non-root user. This minimizes the potential damage that can be caused by vulnerabilities or malicious attacks, as the container processes will have limited permissions on the host system.
Read-Only Filesystem¶
Our Docker container is configured with a read-only filesystem. This means that once the container is up and running, no new files can be written to the filesystem, and existing files cannot be modified. This significantly reduces the risk of malicious modifications to the software or its configuration. If there is a need to modify configurations or add files, it should be done before the container starts or by using Docker volumes.
Image Scanning with Grype¶
Grype is a vulnerability scanner for container images and filesystems. We have integrated Grype to regularly scan our Docker images for known vulnerabilities. This ensures that our software is always up-to-date with the latest security patches. Please contact us in case you interested in our Grype Configuration (Internal Link).
Separated Environments¶
To ensure the stability, security, and quality of our software, we maintain distinct environments for different stages of our software development life cycle:
Development Environment: This is where our software is initially built and tested by developers. It is isolated from production data and systems to prevent unintended disruptions or exposures of confidential data.
Testing Environment: After initial development, changes are moved to our testing environment. This environment is dedicated to rigorous testing procedures, including automated tests, integration tests, and security assessments, ensuring that the software meets our quality and security standards.
Staging Environment: Before deploying updates to our production environment, changes are first deployed to our staging environment. This allows us to test new features and patches in a controlled setting that closely mirrors our production environment.
Production Environment: This is the live environment where our software serves real users. We ensure that only thoroughly tested and vetted code reaches this stage.
Reporting Security Concerns¶
If you discover a potential security issue or vulnerability in our software, please contact us immediately at konfuzio.com/support. We take all reports seriously and will investigate promptly.