How to Dynamically extracting data from pdf

yashashwini2322 · July 17, 2023, 6:01am

Hi all,

I have multiple pdf and want to extract the data dynamically for all pdf’s.
I have attached the screen shot for reference.according to the screenshot
I want to extract the below fields

1.Finding
2.Root cause
3.Risk Assessment
4.Recommendation
5.Management Response and target date
6.Responsible Party

Note:The points highlighted in red colour in image are the seperate points…Like that we have more than 10 points in each point we have the above points to be extracted dynamically.

I’m new to this pdf extracting work,so please give the solution for who to do it.

vrdabberu · July 17, 2023, 6:05am

@yashashwini2322

use the below regex expression

(?<=\n)[A-Z]+[a-z]+.*(?=:\s+)

Regards

lrtetala · July 17, 2023, 6:14am

Hi @yashashwini2322

1.Read the pdf file using Read PDF Text activity and store it in a variable.
2.By using regex you can extract that data.

(?<=\n)[A-Z].*(?=:)

I hope it helps!!

rlgandu · July 17, 2023, 6:16am

@yashashwini2322

Hi,

Use read pdf text to read the pdf and store it in string.
Use regex or string manipulations to extract the data in the pdf
System.text.RegularExpressions.Regex.Match(“YourPdfText”,“(?<=\n)[A-Z]+[a-z]+*(?=:)”)).Value

yashashwini2322 · July 17, 2023, 6:25am

Want to extract the data which present in the above points heading…the points mentioned are the sub heading
And the

Moderate…
2.Minor…

Are the points heading

And the main heading will be
Detailed audit findings

Only Main heading “DETAILED AUDIT FINDINGS” will be constant…
Sub heading - Points 1…,2…will not constant
And the subheading inside points 1 …,2… will be constant.

lrtetala · July 17, 2023, 6:29am

@yashashwini2322

Required fields are please mark as yellow and send it me

rlgandu · July 17, 2023, 6:38am

@yashashwini2322

Hi

Using this syntax you are able to extract the
2.Root cause
3.Risk Assessment
4.Recommendation
5.Management Response and target date
6.Responsible Party

yashashwini2322 · July 17, 2023, 6:40am

The red highlighted are common heading
The green highlighted are Subheading with points as number
The yellow highlighted are the files to extract dynamically

Note.: Like this have 10 or more Green subheading with points in pdf…

Anil_G · July 17, 2023, 7:12am

@yashashwini2322

One way would be that …if each sub heading or the green part is starting with a number then you can do as below

Say all data is stored in str variable

Then to get each wuestion seprately along with its related data we can use

System.Text.RegularExpressions.Regex.Split(str,"^\d+\.")

The anove one will split of the points 1. ,2. Etc and give each point and its explanation together

Then you can use for loop to loop throughe ach of it and then use split sub sequently now using main headings Finding: etc

CurrentItem.Split({"Finding:","Root Cause:", and all headings},Stringsplitoptions.None) and lets say its stored in arr a array of string type variable

This will give array of strins in with if you use arr(0) will give the wuestions…arr(1) will give the findings and so on

Hope this helps

Cheers

yashashwini2322 · July 17, 2023, 10:28am

Sorry …Can you elaborate more…?

Anil_G · July 17, 2023, 10:58am

@yashashwini2322

Please check this sample

Sequence.xaml (8.7 KB)

cheers

yashashwini2322 · July 17, 2023, 12:00pm

No… it’s not working

Anil_G · July 17, 2023, 12:17pm

@yashashwini2322

Can you please tell ehat is not working…

As I tried the same and i am able to see the exact output…

What different did you see? Only not working will not help in understandingn and giving a proper solution

Cheers

yashashwini2322 · July 17, 2023, 2:24pm

Page number for this is not constant…there are multiple pages in pdf file and not sure which page this part will exist …so the above code which you have shared is not working…for sample I have attached the image…

Anil_G · July 17, 2023, 2:39pm

@yashashwini2322

You need to read whole of the pdf at once and then use the code…what way evrything will be split…instead of reading each page

few extra texts would come lets see if the first step works on how to remove the remaining

cheers

supermanPunch · July 17, 2023, 3:08pm

Hi @yashashwini2322 ,

As we have understood the data is confidential or could not be sent over for testing from our end, we suggest you to go through the Below tutorials done.

Understanding the above tutorials would help you solve many more problem statements with Regex or String Manipulation and understand the right criteria for using them.

Try out the Regex Expressions and The data that you have in either Matches activity (we can test the data there) or if accessible you could go to the below website and check your expressions there.

yashashwini2322 · July 20, 2023, 6:53am

Hi,

This code is splitting where ever the number will be present inside the paragraph also…

For example.

1.Detailed audit findings

Findings
We are removing the 22.4% of data in the year 2023.

Root cause
ABC xyz

Audit

Findings
We are not escalating 34 % of employees.

Root cause
Xyz abc

The code is splitting in 1. Deta…and in 22.4 also…
And 2.audit and in 34 also…

I want the split to be happen in 1.detailed and 2.audit only…
What ever number comes in middle of the point it should not split.

Output…
Split 1- 1.detailed audit findings
Split 2- 2.audit

Anil_G · July 20, 2023, 7:08am

@yashashwini2322

Can you please check 22.4 it should not as I have given number should be the first character

Did you happen to change anything in the split?..can you show the split in the for loop in argument please

cheers

yashashwini2322 · July 20, 2023, 7:18am

^\d+.
This I have used in split function so… it’s splitting everywhere.
So

Moderate
Minor
Minor
Major

Infront of the number and “.” The above words are common…
So. Where ever these comes then I only want to split…

Anil_G · July 20, 2023, 7:20am

@yashashwini2322

Can you please try this

^\d+\.\s+[Moderate|Major|Minor]

cheers

Topic		Replies	Views
Extract Specific text from multiple Pdf's Studio studio , question , activities_panel	4	541	November 21, 2023
PDF particular data Activities pdf , activities	7	398	May 8, 2023
Extract certain key words from multiple pdfs Activities pdf , activities , question	8	913	February 8, 2022
Extract from multiple pdf Studio activities , studio	10	1369	July 29, 2022
How To Extract Data From PDF Using 'Read PDF Text' And RegEx ? Knowledge Base activities	0	513	August 8, 2023

Most Active Users - Yesterday
sharazkm32
prashant1603765
V_Roboto_V
ashokkarale
Stef_99
More details...

How to Dynamically extracting data from pdf

Related topics