How to Dynamically extracting data from pdf

Hi all,

I have multiple pdf and want to extract the data dynamically for all pdf’s.
I have attached the screen shot for reference.according to the screenshot
I want to extract the below fields

1.Finding
2.Root cause
3.Risk Assessment
4.Recommendation
5.Management Response and target date
6.Responsible Party

Note:The points highlighted in red colour in image are the seperate points…Like that we have more than 10 points in each point we have the above points to be extracted dynamically.

I’m new to this pdf extracting work,so please give the solution for who to do it.

@yashashwini2322

use the below regex expression

(?<=\n)[A-Z]+[a-z]+.*(?=:\s+)

Regards

Hi @yashashwini2322

1.Read the pdf file using Read PDF Text activity and store it in a variable.
2.By using regex you can extract that data.

(?<=\n)[A-Z].*(?=:)

I hope it helps!!

@yashashwini2322

Hi,

Use read pdf text to read the pdf and store it in string.
Use regex or string manipulations to extract the data in the pdf
System.text.RegularExpressions.Regex.Match(“YourPdfText”,“(?<=\n)[A-Z]+[a-z]+*(?=:)”)).Value

Want to extract the data which present in the above points heading…the points mentioned are the sub heading
And the

  1. Moderate…
    2.Minor…

Are the points heading

And the main heading will be
Detailed audit findings

Only Main heading “DETAILED AUDIT FINDINGS” will be constant…
Sub heading - Points 1…,2…will not constant
And the subheading inside points 1 …,2… will be constant.

@yashashwini2322

Required fields are please mark as yellow and send it me

@yashashwini2322

Hi

Using this syntax you are able to extract the
2.Root cause
3.Risk Assessment
4.Recommendation
5.Management Response and target date
6.Responsible Party

The red highlighted are common heading
The green highlighted are Subheading with points as number
The yellow highlighted are the files to extract dynamically

Note.: Like this have 10 or more Green subheading with points in pdf…

@yashashwini2322

One way would be that …if each sub heading or the green part is starting with a number then you can do as below

Say all data is stored in str variable

Then to get each wuestion seprately along with its related data we can use

System.Text.RegularExpressions.Regex.Split(str,"^\d+\.")

The anove one will split of the points 1. ,2. Etc and give each point and its explanation together

Then you can use for loop to loop throughe ach of it and then use split sub sequently now using main headings Finding: etc

CurrentItem.Split({"Finding:","Root Cause:", and all headings},Stringsplitoptions.None) and lets say its stored in arr a array of string type variable

This will give array of strins in with if you use arr(0) will give the wuestions…arr(1) will give the findings and so on

Hope this helps

Cheers

Sorry …Can you elaborate more…?

@yashashwini2322

Please check this sample



Sequence.xaml (8.7 KB)

cheers

No… it’s not working

@yashashwini2322

Can you please tell ehat is not working…

As I tried the same and i am able to see the exact output…

What different did you see? Only not working will not help in understandingn and giving a proper solution

Cheers

Page number for this is not constant…there are multiple pages in pdf file and not sure which page this part will exist …so the above code which you have shared is not working…for sample I have attached the image…


@yashashwini2322

You need to read whole of the pdf at once and then use the code…what way evrything will be split…instead of reading each page

few extra texts would come lets see if the first step works on how to remove the remaining

cheers

Hi @yashashwini2322 ,

As we have understood the data is confidential or could not be sent over for testing from our end, we suggest you to go through the Below tutorials done.

Understanding the above tutorials would help you solve many more problem statements with Regex or String Manipulation and understand the right criteria for using them.

Try out the Regex Expressions and The data that you have in either Matches activity (we can test the data there) or if accessible you could go to the below website and check your expressions there.

Hi,

This code is splitting where ever the number will be present inside the paragraph also…

For example.

1.Detailed audit findings

Findings
We are removing the 22.4% of data in the year 2023.

Root cause
ABC xyz

  1. Audit

Findings
We are not escalating 34 % of employees.

Root cause
Xyz abc

The code is splitting in 1. Deta…and in 22.4 also…
And 2.audit and in 34 also…

I want the split to be happen in 1.detailed and 2.audit only…
What ever number comes in middle of the point it should not split.

Output…
Split 1- 1.detailed audit findings
Split 2- 2.audit

@yashashwini2322

Can you please check 22.4 it should not as I have given number should be the first character

Did you happen to change anything in the split?..can you show the split in the for loop in argument please

cheers

^\d+.
This I have used in split function so… it’s splitting everywhere.
So

  1. Moderate
  2. Minor
  3. Minor
  4. Major

Infront of the number and “.” The above words are common…
So. Where ever these comes then I only want to split…

@yashashwini2322

Can you please try this

^\d+\.\s+[Moderate|Major|Minor]

cheers