As i have used Read PDF with OCR in my workflow,I am unable to get the output in structured form i.e, i am using Uipath Document OCR Engine but i am not able to get the desired result.(The extracted data should at the same position as we can see them on screen when the file is open.)
can anyone please help me regarding it.
In my view, that’s not possible with Read pdf with OCR activity as it takes the whole pdf and gives you all details in simple text format without preserving its position on screen.
So, there are few options you can try:
If requirement is not to extract the whole data from pdf but to extract specific data from specific positions, would recommend to use Document Understanding framework of UiPath.
Alternatively, if you want to convert everything in pdf and also retain its position, you can use uipath activities to open and edit those pdfs using acrobat pro dc (using different activities, open application, multiple clicks etc…) which would convert everything and retain its position as well, so you wouldn’t feel a difference.
Technically, Digitizing is just conversion of the PDF to Image and then Text. When PDF is converted to Image and Text is retrieved from it, you wont be able to see that in the same position. you can observe it, by printing through Write Text File right after your OCR digitization.
Pass your Text string and retrieve the values through Regex expressions or through Document extractors like Regex extractors/Form extractors etc… based on your license planning. Since you have not mentioned the objective, just providing my suggestions. Hope this helps.
Thanks for your suggestion,But i am not using DOU.I require the data to be extracted from whole pdf files to text file,that i will be using in some other programming languages.