I need to read pdf text from specific regions of my documents. Those documents usually have more than one page. Regex will not work for me as there are no constant anchors so I m looking to find some other alternative. I know I could use set clipping region and Read with OCR but I would prefer the native reading method because documents are in Greek and OCR is not reliable. It is also much slower and this robot has to run daily for many documents with many fields each one. Is there any way to use ‘‘read pdf text’’ for portions of the document?
Hello @Poulos_Spyros,
This documents are scanned or are ‘native’ PDF generated from applications?
If that is the case then you can try out Document Understanding Framework
You have a nice tutorial here:
If you don’t want to use that you can use Microsoft Cognitive Services directly (UiPath is using this product for some of their activities):
You can use Document Understanding to read the files (it uses OCR only if it is needed so it is not a problem for native PDFs), if the pages comes in a fix order you can select the one you need, if the pages can come in any order you can classify the pages using the keyword classifier.
If you want to get the text from an entire page from a collection of pages it will be easy, but if you need to get the text from a section of an specific page you need to apply some logic to trim the part you need.
Please leave a question if you need more information.