OCR - Text extraction from variable PDF files

MF.RPA · August 23, 2021, 2:00pm

Hello,

I am having some trouble with a variable OCR application.

Scope: I have an RPA created that will extract data from a standard PDF file. This is a purchase order document. The document will have a variable number of lines in the sales records area of the PDF file.

Problem: The RPA that I have created works properly if there is only one sales record on the purchase order. If there are other orders I cannot seem to extract the data

Question: What is the best method to scrape each of the sales records that appear in my PO file?

Is there a way to have the RPA identify how many sales records there are and then scrape 1-3 lines?

Below I included a sample PO document.

Lakshmi · August 23, 2021, 2:03pm

Hello @MF.RPA,
Can you explain more?? Are you using any Document Understanding features to capture PO line items??

MF.RPA · August 23, 2021, 3:26pm

@Lakshmi

I am using OMIPAGE OCR. Using this activity I am able to specify the fields that need to be extracted from the PDF / PO file.

Here is the workflow

Thank you,

Lakshmi · August 24, 2021, 6:17am

Hello @MF.RPA ,

What is the extractor being using in Data extraction scope?
Please use ML extractor for extracting PO related infos. Please check the link below for more info

MF.RPA · August 24, 2021, 1:35pm

Hi,

Thank you for your suggestion. Currently I am using Form Extractor. I am going to try the machine learning extraction. I will report back with my results.

Lakshmi · August 24, 2021, 2:44pm

Form extractor is used for static/constant forms where the layout doesn’t change. You have to use ML extractor for dynamic layouts

Topic		Replies	Views
Extract data from PDFs with varying structures Studio pdf , studio , data_scraping , question	4	504	October 18, 2023
Unstructured PDF Document Understanding	13	2122	April 19, 2022
Issue in Table data extraction using Document understanding Activities orchestrator , activities , document_understanding	8	1654	May 20, 2022
Extracting data from different format PDfs using ML Activities activities , question , document_understanding	4	196	December 8, 2023
PDF extraction from unstructured format Studio pdf , activities , question , intelligent_ocr	7	2890	March 10, 2020

Most Active Users - Yesterday
Anil_G
ashokkarale
sharazkm32
Hosam_Alzahrani
dutta.marina
Steven_ds_55
SenorChang
V_Roboto_V
parnalmahavir.patni
afna
More details...

OCR - Text extraction from variable PDF files

Related topics