Explanations¶
Explanation is discussion that clarifies and illuminates a particular topic. Explanation is understanding-oriented.
Architectural Overview¶
The diagram illustrates the components of a Konfuzio Server deployment. Optional components are represented by dashed lines. The numbers in brackets represent the minimal and maximal container count per component.
OCR Processing¶
After uploading a document, depending on how your project was set up, the document is automatically queued and processed by our OCR. In the below breakdown we try to demystify this process, and give you some insight into all the steps which happen during OCR processing.
Project settings¶
We first look at the projects settings to see what base settings have been set up for your documents.
Chosen OCR engine
easy
precise
Chosen processing type
OCR
Embedding
Embedding and OCR
Chosen auto-rotation option
(Only available for precise OCR engine)
None
Rounded
Exact
File pre-processing¶
During file upload, after the Project settings have been evaluated, we look at the file:
We check if the filetype is supported.
We check if the file is valid and/or corrupted.
If the file is corrupted, some repairing is attempted.
We check if the filetype provides embeddings.
We check if the project enforces OCR.
We then conduct OCR on the file.
We check if the image is angled.
We create thumbnails per page.
We create images per page.
OCR Text extraction¶
During evaluation of both project settings and file, we also process OCR extraction
We use the chosen engine on the pre-processed file.
If “Embedding and OCR” is chosen, internally we check which processing type is the most suitable, and use either Embedding or OCR
Depending on chosen processing type, some pre-processing may be done:
Convert non-PDF Documents to a PDF that is being used here
Convert PDF to text (in case of embeddings)
If some sort of PDF corruption is detected, within our ability we attempt to repair the PDF
If the PDF or TIFF is multi page (and valid) we split the document in pages and process each page separately
We check whether auto-rotation was chosen when the precise OCR engine is used]
If rounded angle correction was chosen, we rotate the image to the nearest 45/90 degrees.
If exact angle rotation was chosen, we rotate the image at its exact angle rotation value.
We attempt to extract the text from (either ocr, embedded or both)
OCR may fail because text on the document is technically unreadable, the file is corrupted or empty and cannot be repaired
OCR may fail because engine does not support the text language
Finally, we return you the extracted text.
Background processes¶
Processes within our server are distributed between Celery workers between several tasks. Together this creates the definition of Servers internal workflow. Below the individual tasks of the Server’s workflow are described in order of their triggered events. Tasks are run in parallel queue’s which are grouped in celery chords. While some tasks in each queue run in parallel, some tasks are still dependent on others. And no next queue will start until all tasks in the queue are finished. More on Celery workflows can be found here: https://docs.celeryq.dev/en/stable/userguide/canvas.html
Queue’s¶
id |
queue-name |
description |
---|---|---|
1 |
|
ocr and post ocr tasks |
2 |
|
non dependent tasks |
3 |
|
categorization tasks |
4 |
|
extraction after ocr |
5 |
|
end queue after OCR and extraction has occurred |
6 |
|
queue for AI training |
7 |
|
queue for RAM intensive ai training |
8 |
|
queue for AI evaluation |
Celery Tasks¶
Document tasks¶
Series of events & tasks triggered when uploading a Document
Queue |
task id |
task name |
description |
default time limit |
---|---|---|---|---|
1, 2 |
1 |
page_ocr |
Apply OCR to the documents page(s). |
10 minutes |
1, 2 |
2 |
page_image |
Create png image for a page. If a PNG was submitted or already exists it will be returned without regeneration. |
10 minutes |
1 |
3 |
set_document_text_and_bboxes |
Collect the result of the pages OCR (OCR from task_id #1) and set text & bboxes. |
1 minute |
3 |
4 |
categorize |
Categorize the document. |
3 minutes |
4 |
5 |
document_extract |
Extract the document using the AI models linked to the Project. |
60 minutes |
5 |
6 |
build_sandwich |
Generates the pdfsandwich for a submitted PDF |
30 minutes |
5 |
7 |
generate_entities |
Generate entities for a document which are shown in the labeling tool. |
60 minutes |
5 |
8 |
set_labeling_available |
Sets the document available for labeling |
10 minutes |
5 |
9 |
get_hocr |
Get hOCR representation for bboxes (bboxes from task_id #3). |
5 minutes |
The overall time to complete all tasks related to a Document (DOCUMENT_WORKFLOW_TIME_LIMIT) is restricted to 2 hours.
Extraction & Category AI Training¶
Extraction AI¶
Series of events triggered when training an extraction AI
Queue |
task id |
task name |
description |
default time limit |
---|---|---|---|---|
2 |
1 |
page_image |
Create png image for a page. If a PNG was submitted or already exists it will be returned without regeneration. |
10 minutes |
6, 7 |
1 |
train_extraction_ai |
Start the training of the Ai model. |
20 hours |
4 |
2 |
document_extract |
Extract the document using the AI models linked to the Project. |
60 minutes |
6, 7, 8 |
3 |
evaluate_ai_model |
Evaluate the trained Ai models performance. |
60 minutes |
Category AI¶
Series of events triggered when training a Categorization AI
Queue |
task id |
task name |
description |
default time limit |
---|---|---|---|---|
8 |
2 |
train_category_ai |
Start the training of the categorization model. |
10 hours |
8, 3 |
3 |
categorize |
Run the categorization against all Documents in the its category. |
3 minutes |
6, 7, 8, 3 |
4 |
evaluate_ai_model |
Evaluate the categorization Ai models performance. |
60 hours |