Read specific pdf text using regular expressions

amc · April 9, 2020, 8:46pm

I am trying to extract this piece of text from a pdf document using regular expressions. I am having difficulty with the format of this text. Any help is greatly appreciated!!

Screenshot (8)

Anthony_Humphries · April 10, 2020, 11:57am

We will only be able to help with the regular expression if you can give us the text scraped from the document as a whole. If this text is extracted as “KDZXL03.6063” from the document, then you can find it in a regex by assigning a string value to System.Text.RegularExpressions.Regex.Match(MyPdfString, "KDZXL03.6063").Value, where MyPdfString is the entire string you’ve extracted from the PDF.

amc · April 11, 2020, 3:46pm

Here is a piece of the pdf document, I have been extracting the other pieces of data by assigning them variables, and extracting like this:
EXAMPLE:
EngHorsepower Rating = System.Text.RegularExpressions.Regex.Match(readtextoutput,“(?<=Engine Horsepower Rating:).+”).Value

Since this part I am trying to extract is a different format I am having trouble, but cannot directly quote it as you suggested because the string is different in each document. Hopefully this helps! Thanks!

Sriharisai_Vasi · April 11, 2020, 4:35pm

is that format constant across all the files?I mean 5alphabets,2 digits,one dot and 4 digits?

amc · April 11, 2020, 4:43pm

Yes same format, just different letters & numbers for each document

Sriharisai_Vasi · April 11, 2020, 4:47pm

then it should be easy. use read pdf activity and use the regex generator in ui path
which will give you collection of matching strings

amc · April 11, 2020, 8:27pm

Can you explain this further?

Sriharisai_Vasi · April 12, 2020, 6:45am

as @Anthony_Humphries said, we would need entire data scrapped from the pdf to proceed further

amc · April 23, 2020, 3:08pm

How would I use the regex generator to extract this specific part of the text?

Anthony_Humphries · April 23, 2020, 3:20pm

Without knowing the surrounding text, it’s hard to say. If you can provide the pdf text that is output, that will help. If you would like to learn to do it yourself, I recommend these resources

amc · April 23, 2020, 3:32pm

This is an image of how the text is extracted (text around it) and I want to extract the Engine Family Name

Anthony_Humphries · April 23, 2020, 3:40pm

This might do the trick, but I can’t be 100% sure without having the exact text extracted rather than an image. I’m assuming the text is stored in string variable MyVar.

Try storing it in a string variable set to this:
System.Text.RegularExpressions.Regex.Replace(System.Text.RegularExpressions.Regex.Match(MyVar, "(?<=ENGINE FAMILY NAME \/ 12 Characters including any period $\.$:\n\n).*$").Value, "\s", String.Empty)

Here is how I tested it in regex101.com:

amc · April 23, 2020, 9:53pm

Thank you! Unfortunately that didn’t work. Any suggestions on other things to try?

msan · April 23, 2020, 11:26pm

@amc
I would be good to have a text file with a sample of actual text.

import System.Text.RegularExpressions to keep lines shorter. I split into multiple statements on purpose but you can condense into one line too.

Assign (String)
pattern = "^\s*ENGINE\s+FAMILY\s+NAME.+?:\s+(?<family>^.+?$)"

Assign (RegexOptions)
options = RegexOptions.Multiline Or RegexOptions.Singleline

Assign (String)
family = Regex.Match(MyVar, pattern, options).Groups("family").ToString

Assign
family = Regex.Replace(family, "\s+", "")

AndresTarazona · April 23, 2020, 11:44pm

Hi @amc

I recommend you to use the Form Extractor activity contained in Intelligent OCR package. It is easy to use, and it works really well to process scanned documents.

Here is an video about how to configure it

amc · April 24, 2020, 3:38pm

Thank you for this!! Can you show me how I would condense this into one line? Or how I would use them as separate statements?

msan · April 24, 2020, 3:42pm

@amc

If you import System.Text.RegularExpressions
result = Regex.Replace(Regex.Match(MyVar, "^\s*ENGINE\s+FAMILY\s+NAME.+?:\s+(?<family>^.+?$)", RegexOptions.Multiline Or RegexOptions.Singleline).Groups("family").ToString, "\s+", "")

If not
result = System.Text.RegularExpressions.Regex.Replace(System.Text.RegularExpressions.Regex.Match(MyVar, "^\s*ENGINE\s+FAMILY\s+NAME.+?:\s+(?<family>^.+?$)", System.Text.RegularExpressions.RegexOptions.Multiline Or System.Text.RegularExpressions.RegexOptions.Singleline).Groups("family").ToString, "\s+", "")

Or
Use an Assign activity each time I mention it and set the variable to the type beween parentheses.

amc · April 24, 2020, 3:48pm

I did it as multiple separate lines like you did above and it worked!! Thank you so much!

I also need to extract the two numbers on either side of the decimal place in the family name, so in the case of this document “3.6”. Do you know how I would do this?
(Note: it is a different number for each document but the same format!)

msan · April 24, 2020, 3:53pm

number = System.Text.RegularExpressions.Regex.Match(result, ".\..").Value

amc · April 24, 2020, 4:05pm

Thank you!! Also, I have been trying to extract both the purchase date and the in-service date separately, and I seem to be getting the in-service date for the purchase date too.

Do you know how I would extract the Purchase Date?

Topic		Replies	Views
I want to read specific text from pdf . How should I read it Studio uiautomation , pdf , question , pdf-extraction	49	1750	May 4, 2023
Read PDF Uipath Activities 2 Help	5	802	September 23, 2020
Text extraction from pdf-- regex Activities pdf	3	358	June 4, 2023
How To Extract Data From PDF Using 'Read PDF Text' And RegEx ? Knowledge Base activities	0	511	August 8, 2023
Get Specific words from a text Studio studio , question , highlight_elements	6	854	April 24, 2023

Most Active Users - Yesterday
Anil_G
ashokkarale
mkankatala
sharazkm32
V_Roboto_V
SorenB
Vhierdy_Hafidz
lrtetala
Umesh_Ganesh
mohamed.amer
More details...

Read specific pdf text using regular expressions

Related topics