I have a requirement where I have to extract particular fields from multiple pages in a PDF file.I believe this is unstructured Input as the PDF is Scanned.
I tried using the following
Read PDF with OCR
Difficult to extract as the Label and Values are separated with multiple lines with different data in between due to Columns in the PDF
Get Text with OCR (using Anchor)
I am able to extract fields from 1st page of the PDF (sometimes not, as the font size changes for different scanned PDF’s), but unable to extract from other pages as they are not visible on screen.
@sushildarveshi For Templates which were digital (not scanned), I have used Read PDF text activity and extracted the required data by using Regex string manipulations. Non-Digital Templates are out of scope untill the client procures OCR Licenses.
@anasm - Thank you for your response.
My requirement - In one PDF file there are multiple Purchase Orders with each PO at a different shipping address. Shipping Addresses spans across multiple lines in the pdf and with Read PDF Text, Shipping Address information gets merged with non relevant PO information. Scrapping address with OCR works for address on the first page. But how to extract address which are on subsequent pages of the pdf file ? I am using Read PDF Text and RegEx extensively to extract many other info of the PO … but stuck with shipping address on lower pages of PDF.
I am having Abbyy Fine Reader trial license for OCR. My company will buy Fine Reader / Flexicapture license. I am not sure
If Flexicapture license will perform OCR activity.
Can Flexicapture capture shipping address across different pages of the pdf file.
@sushildarveshi, Depends on the way you apply the logic. I have class type across different pages, I am using Regex builder this way to extract for all the occurrences as seen in the image