Extracting pdf text at certain position

Hi Friends,

I need to extract header of each pdf page. the header is not same but its position is fixed. I was trying to use regex but not sure how to use it in this case? Can anyone help me around this?

Thanks in advance!

could you share sample pdf ?

can you please provide sample data

This is the top section of one of the pdf page. I need to extract the name which in this case is ‘Kiran Malhotra’.

Fine
We can use read pdf and get the output with a variable of type smarting named str_input
—can I have a view in the output that we get once after reading the file so that we can look on how to use the Regex or split method
Kindly share that screen shot with a writeline activity from output panel

Cheers @Rita_Balmukund_Jaisw

read the pdf file and store it into a variable say Var
use assign activity-
(To) arr_str: (From) - var.split({" “},StringSplitOptions.RemoveEmptyEntries)
use another assign activity-
(To) Name: (From) - arr_str(0)+” "+arr_str(1)

Read PDF and save it has one variable
then
System.Text.RegularExpression.Regex.Match(OutputPDF, “(?<=(From))(.*?)(?=(To))”).Value.Trim

Cheers @Rita_Balmukund_Jaisw

XXX
XX years
The Professional Me The Other Me
• Hobby 1
• Hobby 2
• Hobby 3
Apart from work
• Topic 1
• Topic 2
• Topic 3
Ice breakers
• Friends call me - Answer 1
• Superpower – Answer 2
• If I am AC head – Answer 3
Rapid fire!
• Fact 1
• Fact 2
• Fact 3
Fun facts :slight_smile:
Describe yourself in pictures
Bangalore Home town

Formal
Headshot
Personal
Picture

Kiran Malhotra General Management Team Building
• Education 1
• Education 2
Education

• : XX yrs
• Prior Experience: XX yrs
Experience
• Project 1 details
• Project 2 details

Data in each pdf page changes and hence I am not able to use split function to extract the text in between. so I was thinking on getting position to extract the text

@Rita_Balmukund_Jaisw : Use this method

Thanks Shriharsha. However this give compile error: RegulorExpression is not a member of Text.

@Rita_Balmukund_Jaisw Share the error Screenshot

System.Text.RegularExpressions.Regex.Match(pdfOut, “(?<=From)(.*?)(?=To)”).Value.Trim


here is the screeshot

@Rita_Balmukund_Jaisw

image

System.Text.RegularExpressions.Regex.Match(pdfOut, “(?<=From)(.*?)(?=To)”).Value.Trim

Use this syntax
-pdfOut - its the output variable name of your PDF
-From - From which position you need to capture the data mention that word
-To - Till which word you need to capture that word you can mention

Sorry my bad. But in my case from and to for each page in pdf is variable, so can’t give a fix word.

There is any constants in every page?
How should you know which data need to capture

image

@Rita_Balmukund_Jaisw highlighted text is the constant in the PDF

ystem.Text.RegularExpressions.Regex.Match(pdfOut, “(?<=\r\n)(.*?)(?=General Management)”).Value.Trim
1 Like

Hello Rita,

You may use this in an Assign

Name_Variable = Split(Split(YourPDFTextVariable.ToString, “General Management”)(0),vbcrlf)(1)

Tell me if it works :wink:

Kind regards,
Daniel

Okay… I am using Team Building as a constant here. Then it will give combination of name and designation which I can handle in excel to separate out.

Thanks for the help…:slight_smile:

1 Like

use itextsharp in C# to capture position based data

Build one DLL in C# that will use itextsharp library to fetch position based data from pdf and then you can use that DLL in Uipath.