How to read specified the data from PDF file

I have the PDF file where it consistsBDA_Bulletin_20210501.pdf (8.1 MB) of multiple CPV codes each cpv code will have sections so whenever I fetch cpv code I need to fetch the address of that also which will be present in that section

Can anyone help me out

Hello @chaithanya_kumar_M ,

A high level idea can be:

  1. Read PDF TExt, which will create a string variable.
  2. Split that string by CPV number or something like this
  3. The split will create an array of strings, which should contain I believe the information you have.
  4. For each string, put some activities to check if a specific CPV code exist.
  5. If exist, create some rules to extract the data about the address, using REGEX or other String Operations.
  6. You can put everything in a datatable, than at the end to export it as Excel file. (Write Range)

I hope it helps.

Vasile.

1 Like

@chaithanya_kumar_M - Is this related to this post??

Also please let us know from where you want to fetch the address? screenshot would be helpful…

Hi @prasath17

Yes indeed but here we are fetching only one value the actual task is to fetch all the CPV codes

  1. Fetch the CPV code (we need to search in the entire pdf file since multiple cpv code will be available)
  2. If we fetch the CPV code on the same section we need to fetch the address for the CPV code
    Please check the screenshot



    When i use 0 or more am getting error

Regards,
Chaithanya

Hi @wasea

Thanks for suggestion i will try and let you know.

Regards,
Chaithanya

Hi @chaithanya_kumar_M ,

Also, it might help you if you could take a look over this topic

Hope it helps!
Best regards,
Marius

Sorry Again, it’s not clear…

  1. When I searched for “cpv principal” i got 66 hits in the attached pdf. Would like to fetch all?
  2. When I searched for “adresse principale” i got 29 hits in the attached pdf. And Only 5 or 6 having the CPV code on the same page as shown below…is this want you would like to extract??

It would like to better , if you provide samples from the pdf attached here…and brief would requirement so that we can help…

HI @prasath17

Since each CPV code, they have developed in 3 languages EN,NL,FR but its is not always in 3 languages depends. so what we can do is fetching all cpv code and address then delete the duplicates this is my idea to go for.

@chaithanya_kumar_M - Please find the starter help here…

Build DataTable
image

Output is Dt ==> Datatable variable

Read your PDF and Store the output to StrInput

Matches activity
Input is Strinput

 Patten is used = "(?<=CPV principal:.+)\d{8}"

Output is IEnRegex

Assign

Dt = (From m In IEnRegex.Cast(Of Match)
Select dt.Rows.Add(m.toString)).CopyToDataTable

Assign

 DtUnique = dt.DefaultView.ToTable(True,"CPV Code")

Here is the Outpu :
Output.xlsx (9.9 KB)

Hope this helps…