Data Layer Concepts¶
The relations between all major Data Layer concepts of the SDK are the following: a Project consists of multiple Documents. Each one of the Documents consists of the Pages and belongs to a certain Category. Text in a Document can be marked by Annotations, which can be multi-line, and where each continuous piece of text contained into an Annotation is a Span. Each Annotation is located within a certain Bbox and is defined by a Label that is a part of one of the Label Sets. An Annotation Set is a list of Annotations that share a Label Set.
For more detailed information on each concept, follow the link on the concept’s name which leads to the automatically generated documentation.
Project is essentially a dataset that contains Documents
belonging to different Categories or not having any Category assigned. To initialize it, call
The Project can also be accessed via the Smartview, with URL typically looking like https://YOUR_HOST/admin/server/document/?project=YOUR_PROJECT_ID.
If you have made some local changes to the Project and want to return to the initial version available at the server, or
if you want to fetch the updates from the server, use the argument
Here are the some of properties and methods of the Project you might need when working with the SDK:
project.documents– training Documents within the Project;
project.test_documents– test Documents within the Project;
project.get_category_by_id(YOUR_CATEGORY_ID).documents()– Documents filtered by a Category of your choice;
project.get_document_by_id(YOUR_DOCUMENT_ID)– access a particular Document from the Project if you know its ID.
Document is one of the files that constitute a Project. It consists of Pages and can belong to a certain Category.
A Document can be accessed by
project.get_document_by_id(YOUR_DOCUMENT_ID) when its ID is known to you; otherwise, it
is possible to iterate through the output of
_documents) to see which
Documents are available and what IDs they have.
The Documents can also be accessed via the Smartview, with URL typically looking like https://YOUR_HOST/projects/PROJECT_ID/docs/DOCUMENT_ID/bbox-annotations/.
Here are some of the properties and methods of the Document you might need when working with the SDK:
document.id_– get an ID of the Document;
document.text– get a full text of the Document;
document.pages()– a list of Pages in the Document;
document.update()– download a newer version of the Document from the Server in case you have made some changes in the Smartview;
document.category()– get a Category the Document belongs to;
document.get_images()– download PNG images of the Pages in the Document; can be used if you wish to use the visual data for training your own models, for example;
Category is a group of Documents united by common feature or type, i.e. invoice or receipt.
To see all Categories in the Project, you can use
To find a Category the Document belongs to, you can use
test_documents under the Category, use
You can also observe all Categories available in the Project via the Smartview: they are listed on the Project’s page in the menu on the right.
Page is a constituent part of the Document. Here are some of the properties and methods of the Page you might need when working with the SDK:
page.text– get text of the Page;
page.spans()– get a list of Spans on the Page;
page.number– get Page’s number, starting from 1.
Span is a part of the Document’s text without the line breaks. Each Span has
end_offset denoting its starting and finishing characters in
To access Span’s text, you can call
span.offset_string. We are going to use it later when collecting the Spans from the Documents.
Annotation is a combination of Spans that has a certain Label (i.e. Issue_Date, Auszahlungsbetrag) assigned to it. They typically denote a certain type of entity that is found in the text. Annotations can be predicted by AI or human-added.
Like Spans, Annotations also have
end_offset denoting the starting and the ending characters. To access the text under the Annotation, call
To see the Annotation in the Smartview, you can call
annotation.get_link() and open the returned URL.
Annotation Set is a group of Annotations united by Labels
belonging to the same Label Set. To see Annotations in the set, call
Label defines what the Annotation is about (i.e. Issue_Date,
Auszahlungsbetrag). Labels are grouped into Label Sets. To see Annotations with a current Label,
Label Set is a group of Labels united. A Label Set can belong to different Categories and multiple Annotation Sets.
Bbox is an area of the Page denoted by four rectangle-like
coordinates. You can access Bboxes of the Document by calling