In document understanding digitization method there is no option of how many pages we want to extract like read pdf activity, currently it is extracting all the information in a pdf using OCR and outputs a string and DOM, but all the information I want to extract is in first page itself.
And when I use regex extractor the information I want to extract is in the first page and last page and hence its confidence is 50% because it is matching 2 words of exact information.
If there a way to limit the digitization process to single page like read pdf it will be really helpful.
While it would be handy to have an option to only select a certain page for extraction, Digitize Document isn’t just targeted at PDF files. From a practical standpoint, there may be factors that impact selective behaviour across different file types.
Digitize Document has a specific purpose which it serves as expected.
That being said, your problem is a very easy one to solve. Since you know you always want Page 1 of the PDF, you can use the standard UiPath activity
You can specify the input filepath and create a separate file which you can use for digitization. (I’d imagine even with an integrated option to use only a single page or page range, the activity would internally need to create a separate pdf anyway. So it’s not much of an overhead, if you think about it.)
I hope this makes sense and that it helps you solve your immediate concern.
Thanks for the suggestion, but even it solves my problem temporarily I have 2000 files (maybe more in the future) to process and I cannot create a copy of these 2000 files again with only first page as it will be more time consuming and have additional steps in the automation, but thanks for the idea it can be implemented if we have small number of PDF files to process.
It would be a temporary file, of course, just to save you the time to digitize the entire document, which will take longer to process.
You would only need to add two steps in your process:
- Split pdf and read only the first page
- Delete the file once processing is complete.
It shouldn’t matter how many files need to be processed, there is no manual effort in splitting the PDFs.
Anyway, it was just a suggestion to solve your current problem. If the feature to select range makes sense to UiPath Dev team, you’ll see that addition soon enough