Data scrapping for Pdf

Dear Sir,
ATLAST Purchaae Order-converted (1)2.pdf (74.6 KB)

In this pdf i want following data in variable

1-- data – “07-December-2021” after (Purchase Order Date )
2-- “ATPL/PO.NO.039/1112” after (Customer Ref No)
3-- " Pasrt Software India Pvt. Ltd" after (Customer Name : )

4- “HW296-Go-Global for Windows Remote Access Software Concurrent User License”

5- “9973…”

6- “1.00”

7-- “5500.00”

8-- “85000”

9-- “Against invoice 1234” after (Remarks : )

Thanks,

HI @badal_patel

Have you tried this with Document Understanding?

Regards
Gokul

yes, but not working

can you please let me know if any sollution available

If you are not using Document Understanding or some other OCR you will need to use Read PDF Text and then a number of regex statements. And, this option will only work on this document. If you present other documents against this option and they have different structures the second option will not work

can you send me a regex code

You can read the pdf and then use the regular expression to fetch the value, but it will work only for similar pdf where the static fields remain same (like Purchase Order Date, Customer Name :… )
I tried for few of the fields and values are coming properly.

(?<=Purchase Order Date\s+)([\d-\w]+)
(?<=Customer Ref No\s+)([\w/.\d]+)
(?<=Customer Name :\s+)(.*)(?= Shipping)

sample:

I’ll share the remaining at the earliest.
I hope, it helps.

only Customer Ref No is working

please send me the remaining ASAP
Thanks

For me all are working so far. Working on others, might take sometime.

(?<=Total Amount Before Tax )(.) --This is for 5,500
(?<=Total amount after Tax )(.
) --This is for 85,500
(?<=Remarks : )(.*)


all working

please send me for
“9973…”

“1.00”

The last one is tricky, this regEx will give you all the item and you have to loop through it and fetch the column u want.

(\d+ )([\s\w-.,]+)(?=Total Amount Before Tax)

I hope, it helps.

Hi @badal_patel

Here is the workflow

PDFExtact.xaml (10.2 KB)

Hope it will work

Regards
Gokul

Hi @badal_patel

Try this expression for description

System.Text.RegularExpressions.Regex.Match(RemoveSpace,"(?<=^\d\s)(\S+)(.*?)(?=\d)").ToString.Trim

Regards
Gokul

Another way of fetching the goods details is Data Scrapping

  1. Scrap the data and update the selector to accepts all the similar pdfs.
  2. Filter the data using below condition (Column-5 is “Unit Price”) and create new DataTable

image

  1. This DataTable will have only goods details from all similar invoice, which can be easily fetch by another workflow (pass this datatable as argument and write the logic to fetch all the information).

Sample Output:

can you please send me a flow?

HI @badal_patel

Do you have PDF dependencies in you project?

If not download the package from Mange packages and also enable the activity which are in disable state(Comment out)

Regards
Gokul

Below is flow file. Also, this is just for one pdf, you have to build the logic around it for multiple files.
Main.xaml (11.0 KB)

Datatable “NewDT” will have only goods details, which you can use further to extract relevent information.

I hope, it helps.

can you please send me .zip file with also screenshot folder zip