Pdf data scrapping

Hi All,

Hope all you are doing well,

Currently, I’m working in a project which I have to extract data from multiple pdf files. But the problem is I have to extract data from a table which is the pdf file and it can vary from file to file ( Add or remove rows).

i want to extract those table data and fill it into a excel form. I have tried several pdf data extracting methods but none of them work out for this. :disappointed_relieved:

Kindly requesting your help on this

Herewith i have attached the same pdf files and images of the data table.

Pdf documents -
4551660987.pdf (25.6 KB)
4552026042.pdf (25.5 KB)

Hi

  1. Install the required package
  2. Enable PDF Accessibility mode. this step you have to do in the ur pdf document ( go to edit-
    Accessibility- change reading option - choose use reading in raw)
    (Make sure don’t forget to enable Pdf Accessibility mode it’s a most important step without this
    you can not able select proper word.)
  3. Then use get a text Activity then indicate to you are required Bold + italic format which one you want ok
  4. Get text activity to choose proper selector n make them dynamic which one valid for you. get variable name
  5. Use Assign activity use the same variable with replacing or remove with a trim function which suits you.
    6.variable saving in string only,
1 Like

Hi @ankur1984

Thanks a lot for the quick response. ill try it and let you know .

Hi,
I would suggest a regular expression to extract the fields.

Regards,
pavan H

hi @charith_wickramasing,

Step1:Read pdf and get output as string
Step2: Use regex to fetch data,
If possible can you provide this data in text file we can help you by Regex.

Thanks,

Hi Pavanh,

Would you be able to share some examples ?.

sure , i’ll attached the text file. :slight_smile: Thanks a lot for the help.

Hi,
Please share the pdf file from where you need to extract the data if possible and we can write regex to get the required field.

Regards,
Pavan H

Hi Pavanh

Herewith I have attached some sample pdf files and I want to extract the data in the data table as mention in the above.

4551660987.pdf (25.6 KB)
4552026042.pdf (25.5 KB)

HI @Gouda_6

Here is the output for following table

Item
10
20 Material No.
Quantity
12317082
180
12317138
176 Vendor Mat. No
Unit
Case
Case Description
Delivery Date & Time Price/Unit *Net Value
NESTLE CORN FLAKES Cereal 1 8x275g N2 XK
30.03.2019 3,276.00
NESTLE CORN FLAKES Cereal N3 XK
30.03.2019
Total net value excl. tax 1,853.28
5,129.28

Hi @charith_wickramasing,

Try using this regex,

1 Like

Thanks a lot, I’ll check and let you know.

HI @Gouda_6,

its working fine, do you know how to extract the values of Description, Quantity, unit delivery date & time and *net value separately?

example
NESTLE CORN FLAKES Cereal /1 8x275g/ N2 XK/30.03.2019/ 3,276.00

Like this

@charith_wickramasing

In this example you can split wrt to “/”, each element in the array will contain value like array(0) will contains “NESTLE CORN FLAKES Cereal”

HI @Manjuts90

Thanks for the reply,

i mean, is there any particular way to extract values of Description, Quantity, unit delivery date & time and *net value separately ??

from this string

NESTLE CORN FLAKES Cereal 1 8x275g N2 XK
30.03.2019 3,276.00
NESTLE CORN FLAKES Cereal N3 XK
30.03.2019
Total net value excl. tax 1,853.28
5,129.28

@charith_wickramasing Can you mark which all values can extracted? If i extract values from this string also, will format be same across all other strings you get?

Hi @charith_wickramasing,

Try using this regex,

Thanks,