OCR - Text extraction from variable PDF files


I am having some trouble with a variable OCR application.

Scope: I have an RPA created that will extract data from a standard PDF file. This is a purchase order document. The document will have a variable number of lines in the sales records area of the PDF file.

Problem: The RPA that I have created works properly if there is only one sales record on the purchase order. If there are other orders I cannot seem to extract the data

Question: What is the best method to scrape each of the sales records that appear in my PO file?

Is there a way to have the RPA identify how many sales records there are and then scrape 1-3 lines?

Below I included a sample PO document.

Hello @MF.RPA,
Can you explain more?? Are you using any Document Understanding features to capture PO line items??


I am using OMIPAGE OCR. Using this activity I am able to specify the fields that need to be extracted from the PDF / PO file.

Here is the workflow

Thank you,

Hello @MF.RPA ,

What is the extractor being using in Data extraction scope?
Please use ML extractor for extracting PO related infos. Please check the link below for more info


Thank you for your suggestion. Currently I am using Form Extractor. I am going to try the machine learning extraction. I will report back with my results.

Form extractor is used for static/constant forms where the layout doesn’t change. You have to use ML extractor for dynamic layouts