Process PDF Files, Classify Documents & more with new Document Understanding Activities in Studio Web

Monica_Secelean · March 28, 2023, 9:00am

With our newest Document Understanding package for cross-platform projects, we’re bringing to you PDF processing capabilities and more!

Document Data
In order to efficiently work with Documents, we introduced the notion of Document Data - an object which can be used as input or output to Document Understanding activities, containing all information about the document, depending on the activities one uses it with: document type (populated by the new Classify Document Activity), fields (populated by the Extract Document Data Activity), Text and Document Object Model (populated by the first Document Understanding Activity of the workflow, processing the input file - used by all other activities) and others. This object will contain all information one may require for the processed Document, all gathered into one resource - rather than spread upon multiple output objects. We encourage you to pass it over to all Document Understanding Activities, having it modified and populated by these - leading to increase performance by digitizing once (in the background) and reusing this forever.

Classify Documents with the pre-trained Classification Model
We’re to be releasing the Classify Document Activity, which allows you to consume the ML Classification Model for determining the Document Type of a Document: simply provide the document as input and in the resulting Document Data, you can find details about the Document Type and Classification Confidence for it, which you can then use to select an appropriate Extractor.
Note that this version of the classifier only provides you support for the pre-trained classifier model - we will add support for custom classification models soon! And besides these, we’re also working on enabling splitting capabilities - which will populate the list of sub-documents of a Document Data - keep an eye out!

Improved validation experience
Besides the “Create Validation Task and Wait” Activity, with this release we also provide 2 other activities (similar to the ones available in the IntelligentOCR package", namely:

Create Validation Task (not suspending the workflow)
Wait for Validation Task and resume (suspending the workflow)
These activities allow you to leverage other persistence activities in between: maybe you want to assign the newly task? Or add a label to it? Having both, you can easily achieve this after creating the validation task!

Process PDFs in Studio Web
Our latest release brings with it the following activities, meant to process your PDF files in automations, allowing you to:

Extract PDF Text - read all text from a PDF file
Extract PDF Page Range - generate output PDFs for the specified page range
Merge PDFs - join multiple PDF files into a single one
Extract PDF Images - extract charts, graphics, logos and all sorts of images from a PDF file
Set PDF Password - remove or set a new password to a PDF file
Get PDF Page Count - retrieve the number of pages from a PDF

We have worked very hard to deliver all capabilities & hope they come in handy to you - looking forward to your feedback!

zell12 · March 28, 2023, 6:43pm

@Monica_Secelean this is cool! i really liked the idea of consolidating document data into one object which contains all pertinent information about the document being processed!

Regarding the persistence activities, isn’t this already how the document validation activities are (Create Document Validation Task, Wait Document Validation Task) or is this for different purpose or type of custom validation forms?

Monica_Secelean · April 6, 2023, 12:36pm

You are right @zell12 with regards to the Validation Activities, the only note I’d make here is the fact that, we now provide 3 (one combining the 2 you mentioned, for simpler use cases )

oscar · April 6, 2023, 4:51pm

Hey @Monica_Secelean, great new additions to Document Understanding.

Just want to report an issue I have with the Digitize Document activity in normal Studio (not Studio Web). A PDF with text properly shown is incorrectly extracted by the Digitize Document activity and it makes a mess of the extracted contents, for example, a line that should say “Purchase Order 123456” is extracted as “PuInrchavs Orodeic12e345”. Basically it gets the text all wrong even though it is properly extracted in the “Read PDF” activity. Should I submit a ticket to UiPath technical support or is there anywhere else I can report this issue?

Thanks.

Monica_Secelean · April 7, 2023, 1:49pm

Hello @oscar
Thanks for reaching out! It’s best to report the issue & please make sure to also provide us details about the used package versions and ideally also the PDF, so that we are able to reproduce the issue.

Looking forward to hearing from you,
Monica

Topic		Replies	Views
UiPath Community 2023.10 Release - Document Understanding Product News	2	1329	November 15, 2023
Data Extraction using Document Understanding on Studio Web Studio Web document_understanding , uipath-drafts , data-extraction , studio-web	2	1446	March 24, 2022
Generative Extraction & Classification using Document Understanding in cross-platform projects - Public Preview Product News activities , document_understanding , document_processing , generative_document_understanding	73	6191	June 11, 2024
How To Implement A Document Understanding Training Loop On Cross Platform Activities or Studio Web Vote on Tutorials studio , document_understanding , ai_center , ai_fabric , ai-fabric , document-understanding , ai-center , studioweb	6	603	June 3, 2024
UiPath Community 2024.4 Release - Document Understanding Activities Product News document_understanding , document_processing	2	1195	May 9, 2024

Most Active Users - Yesterday
ashokkarale
sonaliaggarwal47
More details...

Process PDF Files, Classify Documents & more with new Document Understanding Activities in Studio Web

Related topics