Scrape Particular fields from unstructured Multiple Pages Scanned PDF

Hi Guys,

I have a requirement where I have to extract particular fields from multiple pages in a PDF file.I believe this is unstructured Input as the PDF is Scanned.

I tried using the following

  1. Read PDF with OCR
  • Difficult to extract as the Label and Values are separated with multiple lines with different data in between due to Columns in the PDF
  1. Get Text with OCR (using Anchor)
  • I am able to extract fields from 1st page of the PDF (sometimes not, as the font size changes for different scanned PDF’s), but unable to extract from other pages as they are not visible on screen.
  1. ABBYY
  • Purchasing ABBYY license is out of scope

@ovi, @loginerror

would appreciate any suggestions or guidelines

Thanks in Advance!

I have the same requirement, but don’t know how to get it. Did you manage to resolve it ?

@sushildarveshi For Templates which were digital (not scanned), I have used Read PDF text activity and extracted the required data by using Regex string manipulations. Non-Digital Templates are out of scope untill the client procures OCR Licenses.

@anasm - Thank you for your response.
My requirement - In one PDF file there are multiple Purchase Orders with each PO at a different shipping address. Shipping Addresses spans across multiple lines in the pdf and with Read PDF Text, Shipping Address information gets merged with non relevant PO information. Scrapping address with OCR works for address on the first page. But how to extract address which are on subsequent pages of the pdf file ? I am using Read PDF Text and RegEx extensively to extract many other info of the PO … but stuck with shipping address on lower pages of PDF.

I am having Abbyy Fine Reader trial license for OCR. My company will buy Fine Reader / Flexicapture license. I am not sure

  1. If Flexicapture license will perform OCR activity.
  2. Can Flexicapture capture shipping address across different pages of the pdf file.

Appreciate any help.

@sushildarveshi, Depends on the way you apply the logic. I have class type across different pages, I am using Regex builder this way to extract for all the occurrences as seen in the image