This topic goes in-depth about the improvements in Document Understanding. To read about other products, please navigate to the main topic here
In the new release, we focused on improving the accessibility of Document Understanding and PDF Activities in Studio, as well as the experience of the Field Level Rule feature (details on configuring them here and on the experience using them in the Validation Station here), after receiving your feedback
Right-to-left Language support
We have added support for right-to-left languages like Arabic, Hebrew, and Persian. This feature provides improved accuracy and efficiency in data extraction, streamlining document processing workflows for users who work with right-to-left languages.
Updates on Business Rules
Mathematical Formula Field Rules β
With this release, we have added a new rule type that allows for the definition of mathematical formulas for both simple fields or column fields of type Number, referencing other number fields or number values. In this sense, one can provide one or multiple of the following:
- Field: either of the below:
- a simple field of type Number
- a column field of type Number
- or a fixed value (provided by the user)
- Mathematical Operator: +, *, -
- Grouping Operator: (,)
All these to model use cases like:
- Total > 100
- Total = Subtotal + Delivery β Discount
- Line Amount = Unit Price * Qty (all 3 being column fields, rule applicable for each row of the table)
- Total Discount = sum (Discount Value)
- Total Price = sum (Unit Price * Qty)
- Total Price = sum(Line Amount) + Tax - Total Discount
- And many more shall we be missing anything, do shout out β and keep watching, more rules one their way
Automatically applied rules in the Validation Station
Remember the Field Level Business Rules feature we previewed some while back? Where one would check the extraction against certain pre-defined rules, in the Validation Station? Until now, the rules have been verified when submitting the validation session β however, with this new release, they will automatically be applied, so that one can see the results quickly, reducing like so the time one spends validating the documents.
Enhanced the Forms Extractor page-matching algorithm
For the Form Extractor to correctly extract the data from a document, until now, the document pages needed to be in the order in which the Template has been configured β with this new release, we have enhanced the algorithm and are using the βpage matching infoβ to identify the page and match the result of it to the page of the document received as input to the activity. In this way, we rely on exact matching info, instead of a page order when identifying and extracting the data, leading to an improved extraction result β even for scanned documents for which the pages do not respect a particular order.
Dataset Size Calculator in Semi-structured AI Document Types
This is a new functionality for Dataset Diagnostics which can be accessed by clicking on the Dataset Health indicator in the top bar of the Document Manager as indicated below.
There is a new tab called Calculator on the Dataset Diagnostics dialog. On this tab one can see an up to date estimation of dataset size required for a given Document Type. The numbers of fields of all 3 types are automatically populated based on the schema in the Document Type and on the Out-of-the-Box Document Type selected in the top left dropdown.
Note that you must select the number of Layouts yourself from the bottom right dropdown.
Benefits : Allows users to adjust the Out-of-the-box Document Type they want to train on, as well as the number of Languages or Layouts, and see how that impacts the size of the dataset required for a high performing Extractor.