{ "cells": [ { "cell_type": "markdown", "id": "ad59a0eb", "metadata": {}, "source": [ "## Tokenization\n", "\n", "---\n", "\n", "**Prerequisites:**\n", "\n", "- Access to a Project on the Konfuzio Server.\n", "- Data Layer concepts of Konfuzio: Document, Project, Bbox, Span, Label\n", "\n", "**Difficulty:** Easy\n", "\n", "**Goal:** Be familiar with the concept of tokenization and master how different tokenization approaches can be used with Konfuzio.\n", "\n", "---\n", "\n", "### Environment\n", "You need to install the Konfuzio SDK before diving into the tutorial. \\\n", "To get up and running quickly, you can use our Colab Quick Start notebook. \\\n", "\"Open\n", "\n", "As an alternative you can follow the [installation section](../get_started.html#install-sdk) to install and initialize the Konfuzio SDK locally or on an environment of your choice.\n", "\n", "### Introduction\n", "In this tutorial, we will explore the concept of tokenization and the various tokenization strategies available in the Konfuzio SDK. Tokenization is a foundational tool in natural language processing (NLP) that involves breaking text into smaller units called tokens. We will focus on the WhitespaceTokenizer, Label-Specific RegexTokenizer, ParagraphTokenizer, and SentenceTokenizer as different tools for different tokenization tasks. Additionally, we will discuss how to choose the right tokenizer and how to verify that a tokenizer has found all Labels." ] }, { "cell_type": "markdown", "id": "f6c257c9", "metadata": {}, "source": [ "### Whitespace Tokenization\n", "The `WhitespaceTokenizer`, [part of the Konfuzio SDK](https://dev.konfuzio.com/sdk/sourcecode.html#konfuzio_sdk.tokenizer.regex.WhitespaceTokenizer), is a simple yet effective tool for basic tokenization tasks. It segments text into tokens using whitespaces, tabs, and newlines as natural delimiters.\n", "\n", "#### Use case: retrieving the word-level Bounding Boxes for a Document\n", "In this section, we will walk through how to use the `WhitespaceTokenizer` to extract word-level Bounding Boxes for a Document.\n", "\n", "We will use the Konfuzio SDK to tokenize the Document and identify word-level Spans, which can then be visualized or used to extract Bounding Box information." ] }, { "cell_type": "code", "execution_count": null, "id": "6d685a65", "metadata": { "tags": [ "remove-cell" ] }, "outputs": [], "source": [ "from tests.variables import TEST_PROJECT_ID, TEST_DOCUMENT_ID, TEST_PAYSLIPS_CATEGORY_ID, TEST_CATEGORIZATION_DOCUMENT_ID, TEST_SNAPSHOT_ID\n", "import logging\n", "logging.getLogger(\"konfuzio_sdk\").setLevel(logging.ERROR)" ] }, { "cell_type": "markdown", "id": "57646414", "metadata": { "lines_to_next_cell": 0 }, "source": [ "First, we import necessary modules:" ] }, { "cell_type": "code", "execution_count": null, "id": "5518e170", "metadata": {}, "outputs": [], "source": [ "from copy import deepcopy\n", "from konfuzio_sdk.data import Project\n", "from konfuzio_sdk.tokenizer.regex import WhitespaceTokenizer" ] }, { "cell_type": "markdown", "id": "e29094bb", "metadata": {}, "source": [ "Next, initialize a Project and a Document instance. The variables `TEST_PROJECT_ID` and `TEST_DOCUMENT_ID` are placeholders that need to be replaced with actual values when running these steps. Make sure to use a Project and a Document to which you have access." ] }, { "cell_type": "code", "execution_count": null, "id": "26d2a5c8", "metadata": { "tags": [ "remove-output" ] }, "outputs": [], "source": [ "project = Project(id_=TEST_PROJECT_ID, update=True)\n", "document = project.get_document_by_id(TEST_DOCUMENT_ID)" ] }, { "cell_type": "markdown", "id": "d9cdd356", "metadata": { "lines_to_next_cell": 0 }, "source": [ "We create a copy of the Document object to make sure it contains no Annotations. This is needed because during tokenization, new 1-Span-long Annotations are created." ] }, { "cell_type": "code", "execution_count": null, "id": "cc99c2a3", "metadata": {}, "outputs": [], "source": [ "document = deepcopy(document)" ] }, { "cell_type": "markdown", "id": "04e69dc3", "metadata": {}, "source": [ "Then, we tokenize the Document using the Whitespace Tokenizer. It creates new Spans in the Document." ] }, { "cell_type": "code", "execution_count": null, "id": "ac9a37ad", "metadata": { "tags": [ "remove-output" ] }, "outputs": [], "source": [ "tokenizer = WhitespaceTokenizer()\n", "tokenized = tokenizer.tokenize(document)" ] }, { "cell_type": "markdown", "id": "727c7259", "metadata": {}, "source": [ "Now we can visually check that the Bounding Boxes are correctly assigned." ] }, { "cell_type": "code", "execution_count": null, "id": "dfad20c2", "metadata": {}, "outputs": [], "source": [ "tokenized.get_page_by_index(0).get_annotations_image(display_all=True)" ] }, { "cell_type": "markdown", "id": "33305d5d", "metadata": {}, "source": [ "Observe how each individual word is enclosed in a Bounding Box. Also note that there are no Labels in the Annotations associated with the Bounding Boxes, thereby the placeholder 'NO_LABEL' is shown above each Bounding Box.\n", "\n", "Each Bounding Box is associated with a specific word and is defined by four coordinates:\n", "- x0 and y0 specify the coordinates of the bottom left corner;\n", "- x1 and y1 specify the coordinates of the top right corner\n", "\n", "This is used to determine the size and position of the Box on the Page.\n", "\n", "All Bounding Boxes calculated after tokenization can be obtained as follows:" ] }, { "cell_type": "code", "execution_count": null, "id": "c7daa349", "metadata": {}, "outputs": [], "source": [ "span_bboxes = [span.bbox() for span in tokenized.spans()]" ] }, { "cell_type": "markdown", "id": "c70ba0e7", "metadata": {}, "source": [ "Let us inspect the first 10 Bounding Boxes' coordinates to verify that each comprises 4 coordinate points." ] }, { "cell_type": "code", "execution_count": null, "id": "a5cd9a11", "metadata": {}, "outputs": [], "source": [ "span_bboxes[:10]" ] }, { "cell_type": "markdown", "id": "a51ede8b", "metadata": {}, "source": [ "To summarize, here is the full code:" ] }, { "cell_type": "code", "execution_count": null, "id": "f3800037", "metadata": { "tags": [ "skip-execution", "nbval-skip" ] }, "outputs": [], "source": [ "from copy import deepcopy\n", "from konfuzio_sdk.data import Project\n", "from konfuzio_sdk.tokenizer.regex import WhitespaceTokenizer\n", "\n", "project = Project(id_=TEST_PROJECT_ID)\n", "document = project.get_document_by_id(TEST_DOCUMENT_ID)\n", "\n", "document = deepcopy(document)\n", "\n", "tokenizer = WhitespaceTokenizer()\n", "tokenized = tokenizer.tokenize(document)\n", "\n", "tokenized.get_page_by_index(0).get_annotations_image(display_all=True)\n", "\n", "span_bboxes = [span.bbox() for span in tokenized.spans()]\n", "span_bboxes[:10]" ] }, { "cell_type": "markdown", "id": "d50d5b28", "metadata": {}, "source": [ "Note: The variables TEST_PROJECT_ID and TEST_DOCUMENT_ID are placeholders and need to be replaced. Remember to use a Project and Document id to which you have access. " ] }, { "cell_type": "markdown", "id": "20bdee28", "metadata": {}, "source": [ "### Regex Tokenization for Specific Labels\n", "In some cases, especially when dealing with intricate Annotation strings, a custom [RegexTokenizer](https://dev.konfuzio.com/sdk/sourcecode.html#regex-tokenizer) can offer a powerful solution. Unlike the basic `WhitespaceTokenizer`, which split text based on spaces, tabs and newlines, `RegexTokenizer` utilizes regular expressions to define complex patterns for identifying and extracting tokens. This tutorial will guide you through the process of creating and training your own RegexTokenizer, providing you with a versatile tool to handle even the most challenging tokenization tasks. Let's get started!\n", "\n", "In this example, we will see how to find regular expressions that match with occurrences of the \"Lohnart\" (which approximately means _type of salary_ in German) Label in the training data, namely Documents that have been annotated by hand.\n", "\n", "First, we import necessary modules:" ] }, { "cell_type": "code", "execution_count": null, "id": "6d0aa4f9", "metadata": {}, "outputs": [], "source": [ "from konfuzio_sdk.data import Project\n", "from konfuzio_sdk.tokenizer.regex import RegexTokenizer\n", "from konfuzio_sdk.tokenizer.base import ListTokenizer" ] }, { "cell_type": "markdown", "id": "1570ae60", "metadata": {}, "source": [ "Then, we initialize the Project and obtain the Category. We need to obtain an instance of the Category which will be used later to achieve what we need: find the best regular expression matching our Label of interest." ] }, { "cell_type": "code", "execution_count": null, "id": "9ba085a7", "metadata": { "tags": [ "remove-output" ] }, "outputs": [], "source": [ "my_project = Project(id_=TEST_PROJECT_ID)\n", "category = my_project.get_category_by_id(id_=TEST_PAYSLIPS_CATEGORY_ID)" ] }, { "cell_type": "markdown", "id": "14673edf", "metadata": {}, "source": [ "We use the [ListTokenizer](https://dev.konfuzio.com/sdk/sourcecode.html#list-tokenizer), a class that provides a way to organize and apply multiple tokenizers to a Document, allowing for complex tokenization pipelines in natural language processing tasks. In this case, we simply use to hold our regular expression tokenizers." ] }, { "cell_type": "code", "execution_count": null, "id": "77652600", "metadata": {}, "outputs": [], "source": [ "tokenizer_list = ListTokenizer(tokenizers=[])" ] }, { "cell_type": "markdown", "id": "1fe319e6", "metadata": {}, "source": [ "Retrieve the \"Lohnart\" Label using its name." ] }, { "cell_type": "code", "execution_count": null, "id": "5e5f8fd8", "metadata": {}, "outputs": [], "source": [ "label = my_project.get_label_by_name(\"Lohnart\")" ] }, { "cell_type": "markdown", "id": "c1b26e0a", "metadata": {}, "source": [ "Find regular expressions and create RegexTokenizers. We now use `Label.find_regex` to algorithmically search for the best fitting regular expressions matching the Annotations associated with this Label. Each RegexTokenizer is collected in the container object `tokenizer_list`." ] }, { "cell_type": "code", "execution_count": null, "id": "d7c70f35", "metadata": { "tags": [ "remove-output" ] }, "outputs": [], "source": [ "regexes = label.find_regex(category=category)" ] }, { "cell_type": "markdown", "id": "ec7ba605", "metadata": { "lines_to_next_cell": 0 }, "source": [ "Let's see how many and which regexes have been found." ] }, { "cell_type": "code", "execution_count": null, "id": "7712fbed", "metadata": {}, "outputs": [], "source": [ "print(len(regexes))\n", "\n", "for regex in regexes:\n", " print(regex)\n", " regex_tokenizer = RegexTokenizer(regex=regex)\n", " tokenizer_list.tokenizers.append(regex_tokenizer)" ] }, { "cell_type": "markdown", "id": "85f3f8b8", "metadata": {}, "source": [ "Finally, we can use the TokenizerList instance to create new `NO_LABEL` Annotations for each string in the Document matching the regex patterns found." ] }, { "cell_type": "code", "execution_count": null, "id": "ce3607b3", "metadata": { "tags": [ "remove-output" ] }, "outputs": [], "source": [ "document = my_project.get_document_by_id(TEST_DOCUMENT_ID)\n", "document = tokenizer_list.tokenize(document)" ] }, { "cell_type": "markdown", "id": "d433229e", "metadata": {}, "source": [ "To summarize, here is the complete code of our regex tokenization example." ] }, { "cell_type": "code", "execution_count": null, "id": "85703d33", "metadata": { "tags": [ "skip-execution", "nbval-skip" ] }, "outputs": [], "source": [ "from konfuzio_sdk.data import Project\n", "from konfuzio_sdk.tokenizer.regex import RegexTokenizer\n", "from konfuzio_sdk.tokenizer.base import ListTokenizer\n", "\n", "my_project = Project(id_=TEST_PROJECT_ID)\n", "category = my_project.get_category_by_id(id_=TEST_PAYSLIPS_CATEGORY_ID)\n", "\n", "tokenizer_list = ListTokenizer(tokenizers=[])\n", "label = my_project.get_label_by_name(\"Lohnart\")\n", "regexes = label.find_regex(category=category)\n", "\n", "print(len(regexes))\n", "\n", "for regex in regexes:\n", " print(regex)\n", " regex_tokenizer = RegexTokenizer(regex=regex)\n", " tokenizer_list.tokenizers.append(regex_tokenizer)\n", "\n", "document = my_project.get_document_by_id(TEST_DOCUMENT_ID)\n", "document = tokenizer_list.tokenize(document)" ] }, { "cell_type": "markdown", "id": "eaeaf22f", "metadata": {}, "source": [ "### Paragraph Tokenization" ] }, { "cell_type": "markdown", "id": "bd6afd46", "metadata": {}, "source": [ "The `ParagraphTokenizer` [class](https://dev.konfuzio.com/sdk/sourcecode.html#paragraph-tokenizer) is a specialized tool designed to segment a Document into paragraphs. It offers two modes of operation: `detectron` and `line_distance`.\n", "\n", "To determine the mode of operation, the `mode` constructor parameter is used, it can take two values: `detectron` (default) or `line_distance`. In `detectron` mode, the Tokenizer uses a fine-tuned Detectron2 model to assist in Document segmentation. While this mode tends to be more accurate, it is slower as it requires making an API call to the model hosted on Konfuzio servers. On the other hand, the `line_distance` mode uses a rule-based approach that is faster but less accurate, especially with Documents having two columns or other complex layouts." ] }, { "cell_type": "markdown", "id": "450d463b", "metadata": {}, "source": [ "#### line_distance Approach\n", "It provides an efficient way to segment Documents based on line heights, making it particularly useful for simple, single-column formats. Although it may have limitations with complex layouts, its swift processing and relatively accurate results make it a practical choice for tasks where speed is a priority and the Document structure isn't overly complicated.\n", "\n", "##### Parameters\n", "The behavior of the `line_distance` approach can be adjusted with the following parameters.\n", "- `line_height_ratio`: (Float) Specifies the ratio of the median line height used as a threshold to create a new paragraph when using the Tokenizer in `line_distance` mode. The default value is 0.8, which is used as a coefficient when calculating median vertical character size for Page. If you find that the Tokenizer is not creating new paragraphs when it should, you can try **lowering** this value. Alternatively, if the Tokenizer is creating too many paragraphs, you can try **increasing** this value.\n", "\n", "- `height`: (Float) This optional parameter allows you to define a specific line height threshold for creating new paragraphs. If set to None, the Tokenizer uses an intelligently calculated height threshold.\n", "\n", "Using the `ParagraphTokenizer` in `line_distance` mode boils down to creating a `ParagraphTokenizer` instance and use it do tokenize a Document. \n", "\n", "Let's import the needed modules and initialize the Project and the Document." ] }, { "cell_type": "code", "execution_count": null, "id": "0468b2de", "metadata": { "tags": [ "remove-output" ] }, "outputs": [], "source": [ "from konfuzio_sdk.data import Project\n", "from konfuzio_sdk.tokenizer.paragraph_and_sentence import ParagraphTokenizer\n", "\n", "project = Project(id_=TEST_PROJECT_ID)\n", "document = project.get_document_by_id(TEST_DOCUMENT_ID)" ] }, { "cell_type": "markdown", "id": "a811bdea", "metadata": {}, "source": [ "Initialize the `ParagraphTokenizer` and tokenize the Document." ] }, { "cell_type": "code", "execution_count": null, "id": "17d8a257", "metadata": { "tags": [ "remove-output" ] }, "outputs": [], "source": [ "tokenizer = ParagraphTokenizer(mode='line_distance')\n", "document = tokenizer(document)" ] }, { "cell_type": "markdown", "id": "4c34fea3", "metadata": {}, "source": [ "Then, we fetch the image representing the first page of the Document to visualize the Annotations generated by the tokenizer. Set `display_all=True` to show NO_LABEL Annotations." ] }, { "cell_type": "code", "execution_count": null, "id": "b9d8ea9a", "metadata": {}, "outputs": [], "source": [ "document.get_page_by_index(0).get_annotations_image(display_all=True)" ] }, { "cell_type": "markdown", "id": "8edcab9f", "metadata": {}, "source": [ "Due to the complexity of the Document we processed, the `line_distance` approach does not perform very well. We can see that a simpler Document that has a more linear structure gives better results:" ] }, { "cell_type": "code", "execution_count": null, "id": "f1c7f29f", "metadata": { "lines_to_next_cell": 0, "tags": [ "remove-cell" ] }, "outputs": [], "source": [ "from konfuzio_sdk.api import get_project_list\n", "from konfuzio_sdk.data import Project\n", "projects = get_project_list()\n", "TEST_PROJECT_ID = None\n", "while not TEST_PROJECT_ID:\n", " for project in reversed(projects):\n", " if 'ZGF0YV80Ni02NS56aXA=' in project['name']:\n", " TEST_PROJECT_ID = project['id']\n", " break\n", "original_document_text = Project(id_=46).get_document_by_id(44823).text\n", "project = Project(id_=TEST_PROJECT_ID)\n", "TEST_DOCUMENT_ID = [document for document in project.documents if document.text == original_document_text][0].id_\n", "TEST_PAYSLIPS_CATEGORY_ID = project.get_category_by_name('Lohnabrechnung').id_" ] }, { "cell_type": "code", "execution_count": null, "id": "e64faeec", "metadata": { "lines_to_next_cell": 0, "tags": [ "remove-output" ] }, "outputs": [], "source": [ "from konfuzio_sdk.data import Project, Document\n", "from konfuzio_sdk.tokenizer.paragraph_and_sentence import ParagraphTokenizer\n", "from copy import deepcopy\n", "\n", "my_project = Project(id_=TEST_PROJECT_ID)\n", "\n", "sample_doc = Document.from_file(\"sample.pdf\", project=my_project, sync=True)\n", "deepcopied_doc = deepcopy(sample_doc)\n", "tokenizer = ParagraphTokenizer(mode='line_distance')\n", "\n", "tokenized_doc = tokenizer(deepcopied_doc)" ] }, { "cell_type": "code", "execution_count": null, "id": "3a2033bc", "metadata": { "tags": [ "remove-cell" ] }, "outputs": [], "source": [ "tokenized_doc.pages()[0].image_width = 1100\n", "tokenized_doc.pages()[0].image_height = 1400" ] }, { "cell_type": "code", "execution_count": null, "id": "d3220cfb", "metadata": {}, "outputs": [], "source": [ "tokenized_doc.get_page_by_index(0).get_annotations_image(display_all=True)" ] }, { "cell_type": "code", "execution_count": null, "id": "b26b07da", "metadata": { "tags": [ "remove-cell" ] }, "outputs": [], "source": [ "my_project = Project(id_=TEST_PROJECT_ID, update=True)\n", "for document in my_project._documents:\n", " if document.name == 'sample.pdf':\n", " document.delete(delete_online=True)" ] }, { "cell_type": "markdown", "id": "ced73aac", "metadata": {}, "source": [ "Note that this example uses a PDF file named 'sample.pdf', you can download this file here in case you wish to try running this code." ] }, { "cell_type": "markdown", "id": "78f3c1e8", "metadata": {}, "source": [ "#### Detectron (CV) Approach\n", "With the Computer Vision (CV) approach, we can create Labels, identify figures, tables, lists, texts, and titles, thereby giving us a comprehensive understanding of the Document's structure.\n", "\n", "Using the Computer Vision approach might require more processing power and might be slower compared to the `line_distance` approach, but the significant leap in the comprehensiveness of the output makes it a powerful tool.\n", "\n", "##### Parameters\n", "- `create_detectron_labels`: (Boolean, default: False) if set to True, Labels will be created and assigned to the Document. Labels may include `figure`, `table`, `list`, `text` and `title`. If this option is set to False, the Tokenizer will create `NO_LABEL` Annotations." ] }, { "cell_type": "markdown", "id": "29c5ec75", "metadata": {}, "source": [ "Using a Tokenizer in `detectron` mode boils down to passing `detectron` ot the `mode` argument of the `ParagraphTokenizer`. \n", "\n", "Make necessary imports, initialize the Project and the Document." ] }, { "cell_type": "code", "execution_count": null, "id": "5ccfc96a", "metadata": { "tags": [ "remove-output" ] }, "outputs": [], "source": [ "from konfuzio_sdk.data import Project\n", "from konfuzio_sdk.tokenizer.paragraph_and_sentence import ParagraphTokenizer\n", "\n", "project = Project(id_=TEST_PROJECT_ID)\n", "document = project.get_document_by_id(TEST_DOCUMENT_ID)" ] }, { "cell_type": "markdown", "id": "b33576a9", "metadata": { "lines_to_next_cell": 0 }, "source": [ "Initialize the tokenizer and tokenize the Document." ] }, { "cell_type": "code", "execution_count": null, "id": "d4b3bede", "metadata": {}, "outputs": [], "source": [ "tokenizer = ParagraphTokenizer(mode='detectron', create_detectron_labels=True)\n", "\n", "tokenized = tokenizer(document)" ] }, { "cell_type": "code", "execution_count": null, "id": "ee757051", "metadata": {}, "outputs": [], "source": [ "tokenized.get_page_by_index(0).get_annotations_image()" ] }, { "cell_type": "markdown", "id": "32b8fb35", "metadata": {}, "source": [ "Comparing this result to tokenizing the same Document with the `line_distance` approach it is evident that the `detectron` mode can extract a meaningful structure from a relatively complex Document." ] }, { "cell_type": "markdown", "id": "15ddfca6", "metadata": { "lines_to_next_cell": 0 }, "source": [ "### Sentence Tokenization\n", "\n", "The `SentenceTokenizer` is a specialized [tokenizer](https://dev.konfuzio.com/sdk/sourcecode.html#sentence-tokenizer) designed to split text into sentences. Similarly to the ParagraphTokenizer, it has two modes of operation: `detectron` and `line_distance`, and accepts the same additional parameters: `line_height_ratio`, `height` and `create_detectron_labels`.\n", "\n", "To use it, import the necessary modules, initialize the Project, the Document, and the Tokenizer and tokenize the Document." ] }, { "cell_type": "code", "execution_count": null, "id": "90525c57", "metadata": { "tags": [ "remove-cell" ] }, "outputs": [], "source": [ "import time\n", "document = Document.from_file(path=\"../../../../tests/test_data/textposition.pdf\", project=project, sync=True)\n", "time.sleep(15)\n", "YOUR_DOCUMENT_ID = document.id_" ] }, { "cell_type": "code", "execution_count": null, "id": "91fa44a0", "metadata": { "tags": [ "remove-output" ] }, "outputs": [], "source": [ "from konfuzio_sdk.data import Project\n", "from konfuzio_sdk.tokenizer.paragraph_and_sentence import SentenceTokenizer\n", "from copy import deepcopy\n", "\n", "project = Project(id_=TEST_PROJECT_ID, update=True)\n", "\n", "doc = project.get_document_by_id(YOUR_DOCUMENT_ID)\n", "\n", "tokenizer = SentenceTokenizer(mode='detectron')\n", "\n", "tokenized = tokenizer(doc)" ] }, { "cell_type": "markdown", "id": "361849ef", "metadata": { "lines_to_next_cell": 0 }, "source": [ "Visualize the output:" ] }, { "cell_type": "code", "execution_count": null, "id": "d7c32ef7", "metadata": {}, "outputs": [], "source": [ "tokenized.get_page_by_index(0).get_annotations_image(display_all=True)" ] }, { "cell_type": "markdown", "id": "a970fd49", "metadata": {}, "source": [ "### Choosing the right tokenizer\n", "\n", "When it comes to natural language processing (NLP), choosing the correct tokenizer can make a significant impact on your system's performance and accuracy. The Konfuzio SDK offers several tokenization options, each suited to different tasks:\n", "\n", "- **WhitespaceTokenizer**: Perfect for basic word-level processing. This tokenizer breaks text into chunks separated by whitespaces, tabs and newlines. It is ideal for straightforward tasks such as basic keyword extraction.\n", "\n", "- **Label-Specific RegexTokenizer**: Known as character detection mode on the Konfuzio server, this tokenizer offers more specialized functionality. It uses Annotations of a Label within a training set to pinpoint and tokenize precise chunks of text. It is especially effective for tasks like entity recognition, where accuracy is paramount. By recognizing specific word or character patterns, it allows for more precise and nuanced data processing.\n", "\n", "- **ParagraphTokenizer**: Identifies and separates larger text chunks - paragraphs. This is beneficial when your text’s interpretation relies heavily on the context at the paragraph level.\n", "\n", "- **SentenceTokenizer**: Segments text into sentences. This is useful when the meaning of your text depends on the context provided at the sentence level.\n", "\n", "Choosing the right Tokenizer is a matter of understanding your NLP task, the structure of your data, and the degree of detail your processing requires. By aligning these elements with the functionalities provided by the different tokenizers in the Konfuzio SDK, you can select the best tool for your task." ] }, { "cell_type": "markdown", "id": "409b581c", "metadata": {}, "source": [ "### Verify that a tokenizer finds all Labels\n", "\n", "To help you choose the right tokenizer for your task, it can be useful to try out different tokenizers and see which Spans are found by which tokenizer. The `Label` class provides a method called `spans_not_found_by_tokenizer` that can he helpful in this regard.\n", "\n", "Here is an example of how to use the `Label.spans_not_found_by_tokenizer` method. This will allow you to determine if a RegexTokenizer is suitable at finding the Spans of a Label, or what Spans might have been annotated wrong. Say, you have a number of Annotations assigned to the `Austellungsdatum` Label and want to know which Spans would not be found when using the WhitespaceTokenizer. You can follow this example to find all the relevant Spans." ] }, { "cell_type": "code", "execution_count": null, "id": "9e070c61", "metadata": { "lines_to_next_cell": 0 }, "outputs": [], "source": [ "from konfuzio_sdk.data import Project\n", "from konfuzio_sdk.tokenizer.regex import WhitespaceTokenizer\n", "\n", "my_project = Project(id_=TEST_PROJECT_ID)\n", "category = my_project.categories[0]\n", "\n", "tokenizer = WhitespaceTokenizer()\n", "\n", "label = my_project.get_label_by_name('Austellungsdatum')\n", "\n", "spans_not_found = label.spans_not_found_by_tokenizer(tokenizer, categories=[category])\n", "\n", "for span in spans_not_found:\n", " print(f\"{span}: {span.offset_string}\")" ] }, { "cell_type": "code", "execution_count": null, "id": "f197552e", "metadata": { "tags": [ "remove-cell" ] }, "outputs": [], "source": [ "from konfuzio_sdk.api import delete_project\n", "\n", "for document in my_project.documents + my_project.test_documents:\n", " document.dataset_status = 0\n", " document.save_meta_data()\n", " document.delete(delete_online=True)\n", "response = delete_project(project_id=my_project.id_)\n", "assert response.status_code == 204" ] }, { "cell_type": "markdown", "id": "e1c7a239", "metadata": {}, "source": [ "### Conclusion\n", "In this tutorial, we have walked through the essentials of tokenization. We have shown how the Konfuzio SDK can be configured to use different strategies to chunk your input text into tokens, before it is further processed to extract data from it.\n", "\n", "We have seen the `WhitespaceTokenizer`, splitting text into chunks delimeted by white spaces. We have seen the more complex `RogexTokenizer`, which can be configured to use regular expressions to define delimiters matching any arbitrary regular expression. Furthermore, we have shown how regular expressions can be automatically found to be then used by the `RegexTokenizer`. We have also seen the `Paragraph`- and `SentenceTokenizer`, delimiting text according paragraphs and sentences respectively, and their different modes of use: `line_distance` for simpler Documents, and `detectron` for Documents with a more complex structure." ] }, { "cell_type": "markdown", "id": "91df83ad", "metadata": {}, "source": [ "### What's Next?\n", "\n", "- Find out how to train a custom Extraction AI model" ] } ], "metadata": { "kernelspec": { "display_name": "konfuzio", "language": "python", "name": "python3" } }, "nbformat": 4, "nbformat_minor": 5 }