Extract data from pdf invoices to csv: what works best

Preamble: very very new to RPA and UiPath
Problem statement: extract about 10 or 11 fields of data from pdf invoices and write it in an excel/csv file as follows with the same column header names
AccountName, CustomerName Billing: email, phone number, address line1, City, State/Province Zip AR + WO, last date due and customer ID.

Have been exploring several options:

  1. read pdf text : I get a blank flat file

  2. read pdf OCR Text : much better and attached - sample.txt

  3. followed this example https://docs.uipath.com/activities/docs/manual-validation-for-digitize-documents.
    Arrived at attached presentValidationActivity.jpeg

Questions:

  1. If the number of invoices are in the 3000’s and only scanned images are available what is the best way to go?

  2. In the presentValidationActivity,[samplePDF.txt|attachment]
    (upload://j9ps1VSojo4bv17queeXXHCWH1C.txt) (1.9 KB)

I am able to indicate the values to be saved correctly and the save button indicates this info is saved. Where is it stored ? How can I access it?

Hello @mangala_janardhana. Welcome to the forums!

Validation results are stored in another object of type ExtractionResults. You have to provide that object as an output tp the Validation Station activity as shown below.

image

Once you hit Save on the Validation Station (as seen in your screen shots), control returns back to the workflow in the automation.

The next step is to export the extracted results from the ExtractionResults object and write it out to files as shown in this snippet below (steps 1 - 3)

Step 1:
Export Extracted Results as a DataSet - There may be one or more DataTables in each DataSet.

Each DataSet has multiple DataTables. Each DataTable in turn contains different pieces of information related to the data extracted from your document. You can check the results of the Excel output (by step 3 below) to get clarity on the types of information that you can use further downstream in your automation.

Step 2:
The next step is to inspect each DataTable in the DataSet and write each DataTable as a worksheet to an Excel document (the name of each sheet in this case is DT.TableName)

Step 3:
Optionally, you may choose to write the first of the Data tables to a CSV file. The name of the output csv file is the same as the Excel except that it has a .csv extension. This is just a convenient way of linking the Exported Excel file to this CSV.

Cheers! :beers:

thank you. Will try this and revert shortly.

I still have this open question though? is this the best way to go? for invoices in the **3000’**s?
Reason: there is a ‘lag’ for the validation station to bring up… Eventually I won’t need this station after having built the entire robot. But assumption is this a lot slower than the Read PDF OCR text activity. Is this correct?

Validation station is meant to be human-process interface and is bound to be slower as it gathers the information extracted (or failed to extract) along with the document to ease the human supervision process.

Eventually, you may have to have logic to decide if you want to send this to the Action center for Human validation if any of the extracted information is not qualitatively acceptable.

Alternately, you can opt to skip the failed files entirely and save them to a \rejected folder to be handled separately using a faster process if you can devise any.

thanks

agree thanks again