Scrape Particular fields from unstructured Multiple Pages Scanned PDF

anasm · August 20, 2019, 10:25pm

Hi Guys,

I have a requirement where I have to extract particular fields from multiple pages in a PDF file.I believe this is unstructured Input as the PDF is Scanned.

I tried using the following

Read PDF with OCR

Difficult to extract as the Label and Values are separated with multiple lines with different data in between due to Columns in the PDF

Get Text with OCR (using Anchor)

I am able to extract fields from 1st page of the PDF (sometimes not, as the font size changes for different scanned PDF’s), but unable to extract from other pages as they are not visible on screen.

ABBYY

Purchasing ABBYY license is out of scope

@ovi, @loginerror

would appreciate any suggestions or guidelines

Thanks in Advance!

sushildarveshi · December 25, 2019, 2:34pm

I have the same requirement, but don’t know how to get it. Did you manage to resolve it ?

anasm · December 30, 2019, 12:04pm

@sushildarveshi For Templates which were digital (not scanned), I have used Read PDF text activity and extracted the required data by using Regex string manipulations. Non-Digital Templates are out of scope untill the client procures OCR Licenses.

sushildarveshi · December 30, 2019, 12:54pm

@anasm - Thank you for your response.
My requirement - In one PDF file there are multiple Purchase Orders with each PO at a different shipping address. Shipping Addresses spans across multiple lines in the pdf and with Read PDF Text, Shipping Address information gets merged with non relevant PO information. Scrapping address with OCR works for address on the first page. But how to extract address which are on subsequent pages of the pdf file ? I am using Read PDF Text and RegEx extensively to extract many other info of the PO … but stuck with shipping address on lower pages of PDF.

I am having Abbyy Fine Reader trial license for OCR. My company will buy Fine Reader / Flexicapture license. I am not sure

If Flexicapture license will perform OCR activity.
Can Flexicapture capture shipping address across different pages of the pdf file.

Appreciate any help.

anasm · December 31, 2019, 8:13am

@sushildarveshi, Depends on the way you apply the logic. I have class type across different pages, I am using Regex builder this way to extract for all the occurrences as seen in the image

Topic		Replies	Views
PDF Unstructured Data: Handwritten Data, Scanned Data - Extraction from Multiple Pages Help activities	2	1316	February 18, 2019
Scrape Text from Scanned PDF Help pdf , activities , data_scraping , question	11	2961	November 18, 2019
Extract unstructured data (table) in a PDF Help studio	4	3789	April 25, 2018
Efficiently extracting specific fields from scanned Pdf Document Understanding pdf , ocr , activities , question	4	1208	October 7, 2020
Multiple field data extraction from PDF Studio studio , question , activities_panel	10	486	July 20, 2023

Scrape Particular fields from unstructured Multiple Pages Scanned PDF

Related topics