With our newest Document Understanding package for cross-platform projects, we’re bringing to you PDF processing capabilities and more!
Document Data
In order to efficiently work with Documents, we introduced the notion of Document Data - an object which can be used as input or output to Document Understanding activities, containing all information about the document, depending on the activities one uses it with: document type (populated by the new Classify Document Activity), fields (populated by the Extract Document Data Activity), Text and Document Object Model (populated by the first Document Understanding Activity of the workflow, processing the input file - used by all other activities) and others. This object will contain all information one may require for the processed Document, all gathered into one resource - rather than spread upon multiple output objects. We encourage you to pass it over to all Document Understanding Activities, having it modified and populated by these - leading to increase performance by digitizing once (in the background) and reusing this forever.
Classify Documents with the pre-trained Classification Model
We’re to be releasing the Classify Document Activity, which allows you to consume the ML Classification Model for determining the Document Type of a Document: simply provide the document as input and in the resulting Document Data, you can find details about the Document Type and Classification Confidence for it, which you can then use to select an appropriate Extractor.
Note that this version of the classifier only provides you support for the pre-trained classifier model - we will add support for custom classification models soon! And besides these, we’re also working on enabling splitting capabilities - which will populate the list of sub-documents of a Document Data - keep an eye out!
Improved validation experience
Besides the “Create Validation Task and Wait” Activity, with this release we also provide 2 other activities (similar to the ones available in the IntelligentOCR package", namely:
- Create Validation Task (not suspending the workflow)
- Wait for Validation Task and resume (suspending the workflow)
These activities allow you to leverage other persistence activities in between: maybe you want to assign the newly task? Or add a label to it? Having both, you can easily achieve this after creating the validation task!
Process PDFs in Studio Web
Our latest release brings with it the following activities, meant to process your PDF files in automations, allowing you to:
- Extract PDF Text - read all text from a PDF file
- Extract PDF Page Range - generate output PDFs for the specified page range
- Merge PDFs - join multiple PDF files into a single one
- Extract PDF Images - extract charts, graphics, logos and all sorts of images from a PDF file
- Set PDF Password - remove or set a new password to a PDF file
- Get PDF Page Count - retrieve the number of pages from a PDF
We have worked very hard to deliver all capabilities & hope they come in handy to you - looking forward to your feedback!