Regex solution needed for Read PDF Text activity output

jamiejam · April 1, 2021, 8:13pm

I have a process that reads a PDF to extract a data variable. The file is complex and was created by LIfeCycle Designer. Opening the file is slow and a performance drain so I would rather use a “Read PDF Text” activity and regex the text output if I can. The output from the Read PDF Text contains alot of text but the one value I am looking for is found where the row reads “SECTION I”. Then two rows below that I need the last data value (on the right) which is a login id. Is a regex possible to spot the row that starts with “Section I” and then go down 2 rows and pick the value on the right?

SECTION I
Request Type Request Date Current Loginid
Deactivate 03/30/21 iamauserid

A couple related points. The first 2 rows are always identical. Literally the only data that is unique is the date and loginid. Let me know if there is an option to parse like that.

prasath17 · April 1, 2021, 8:22pm

Hi @jamiejam - So you would like to extract the LoginID value which dynamic in the row underneath the section I? or the value "Iamuserid?

prasath17 · April 1, 2021, 8:33pm

@jamiejam - Assuming there is no other data after IamuserID…you can try this pattern…

Adrian_Star · April 1, 2021, 9:04pm

Pattern:

output = System.Text.RegularExpressions.Regex.Match(your_string,"(?<=SECTION I\s?.+\s?.+\s?\d{2}\/\d{2}\/\d{2}\s+).+").Value

Link: regex101: build, test, and debug regex

Or

output = System.Text.RegularExpressions.Regex.Match(your_string,"(?<=SECTION I\s?.+\s?\w*\s+\d.*\s+).+").Value

Link: regex101: build, test, and debug regex

jamiejam · April 5, 2021, 7:05pm

Thanks very much @prasath17 and @Adrian_Star for your expertise on this one. The value I am extracting from the PDF text is preceded and followed by other text. Using (?<=SECTION I: Account Action\s?.+\s?\w*\s+\d.*\s+).+ hits that value the most precisely of the options you noted but it picks up the value in the next section (line of text) that follows after the carriage return.

How can I get the regex to stop reading at that line given that all other lines (sections) that follow will be different. Pull back imauserid in this mix. The regex picks up Section II More Stuff.

SECTION I
Request Type Request Date Current Loginid
Deactivate 03/30/21 iamauserid
SECTION II More Stuff
Blah blah
SECTION III Even More Stuff
Blah blah blah
SECTION IV: So much stuff
Blah blah blah

prasath17 · April 5, 2021, 7:23pm

@jamiejam - I see, it is pulling the Right value with the pattern provided…

Adrian_Star · April 6, 2021, 6:29am

Hi,

output = System.Text.RegularExpressions.Regex.Match(your_string,"(?<=SECTION I\s?.+\s?\w*\s+\d.*\s+).+\s+(?=SEC)").Value

Link: regex101: build, test, and debug regex

jamiejam · May 25, 2021, 6:17pm

Good afternoon @Adrian_Star,

I am continuing to work on this automation and need assistance on another basic lookahead scenario. On a line in the PDF text I will get a distinct text value of
“DOC NO.”

On the ensuing line always at the end of that line a value will begin with Y-99 where 99 is a 2 digit number.

NAME 2. PAYEE CODE 3. REPORT DATE 4. PAY YEAR 5. DOC NO.
Roli Masamino 112233445 5/13/2021 2021 Y-21-DOC-000000017

I am trying to capture that value at the end of line 2. Either by the 4 digit year that precedes it or by the text pattern at the start. In either case it is always at the end of the next line. Your assistance is greatly appreciated. I am close with this and couldn’t quite get it.

prasath17 · May 26, 2021, 3:19am

You would like to capture the entire text as i highlighted above??

There are multiple ways to fetch it, Here I have anchored the date value

(?<=DOC NO.\r?\n?.+\d{1,2}/\d{1,2}/\d{4}\s\S+\s).+

Hope this helps…

Adrian_Star · May 26, 2021, 4:45pm

Hi,
my solution for You:

output = System.Text.RegularExpressions.Regex.Match(your_string,"(?<=DOC NO.(\s|\S)+(?=Y-\d{2}-))\b.+\b$").Value

Link: regex101: build, test, and debug regex

Topic		Replies	Views
I want to read specific text from pdf . How should I read it Studio uiautomation , pdf , question , pdf-extraction	49	1833	May 4, 2023
Grab specific info in PDF text with Regex Studio studio , question , activities_panel	14	541	August 18, 2023
How to Get text from PDF if it is in multiple lines Studio pdf , activities , studio , question	7	1695	October 14, 2021
Read range of lines in PDF Help	25	4900	February 25, 2019
Read specific pdf text using regular expressions Studio uiautomation , activities	34	6474	June 26, 2020

Regex solution needed for Read PDF Text activity output

Related topics