Read specific pdf text using regular expressions

I am trying to extract this piece of text from a pdf document using regular expressions. I am having difficulty with the format of this text. Any help is greatly appreciated!!

1 Like

We will only be able to help with the regular expression if you can give us the text scraped from the document as a whole. If this text is extracted as “KDZXL03.6063” from the document, then you can find it in a regex by assigning a string value to System.Text.RegularExpressions.Regex.Match(MyPdfString, "KDZXL03.6063").Value, where MyPdfString is the entire string you’ve extracted from the PDF.

2 Likes

Here is a piece of the pdf document, I have been extracting the other pieces of data by assigning them variables, and extracting like this:
EXAMPLE:
EngHorsepower Rating = System.Text.RegularExpressions.Regex.Match(readtextoutput,"(?<=Engine Horsepower Rating:).+").Value

Since this part I am trying to extract is a different format I am having trouble, but cannot directly quote it as you suggested because the string is different in each document. Hopefully this helps! Thanks!

1 Like

is that format constant across all the files?I mean 5alphabets,2 digits,one dot and 4 digits?

1 Like

Yes same format, just different letters & numbers for each document

1 Like

then it should be easy. use read pdf activity and use the regex generator in ui path
which will give you collection of matching strings

2 Likes

Can you explain this further?

1 Like

as @Anthony_Humphries said, we would need entire data scrapped from the pdf to proceed further

1 Like

How would I use the regex generator to extract this specific part of the text?

1 Like

Without knowing the surrounding text, it’s hard to say. If you can provide the pdf text that is output, that will help. If you would like to learn to do it yourself, I recommend these resources

rexegg.com

1 Like

This is an image of how the text is extracted (text around it) and I want to extract the Engine Family Name

1 Like

This might do the trick, but I can’t be 100% sure without having the exact text extracted rather than an image. I’m assuming the text is stored in string variable MyVar.

Try storing it in a string variable set to this:
System.Text.RegularExpressions.Regex.Replace(System.Text.RegularExpressions.Regex.Match(MyVar, "(?<=ENGINE FAMILY NAME \/ 12 Characters including any period \(\.\):\n\n).*$").Value, "\s", String.Empty)

Here is how I tested it in regex101.com:

1 Like

Thank you! Unfortunately that didn’t work. Any suggestions on other things to try?

1 Like

@amc
I would be good to have a text file with a sample of actual text.

import System.Text.RegularExpressions to keep lines shorter. I split into multiple statements on purpose but you can condense into one line too.

Assign (String)
pattern = "^\s*ENGINE\s+FAMILY\s+NAME.+?:\s+(?<family>^.+?$)"

Assign (RegexOptions)
options = RegexOptions.Multiline Or RegexOptions.Singleline

Assign (String)
family = Regex.Match(MyVar, pattern, options).Groups("family").ToString

Assign
family = Regex.Replace(family, "\s+", "")

2 Likes

Hi @amc

I recommend you to use the Form Extractor activity contained in Intelligent OCR package. It is easy to use, and it works really well to process scanned documents.

Here is an video about how to configure it

2 Likes

Thank you for this!! Can you show me how I would condense this into one line? Or how I would use them as separate statements?

1 Like

@amc

If you import System.Text.RegularExpressions
result = Regex.Replace(Regex.Match(MyVar, "^\s*ENGINE\s+FAMILY\s+NAME.+?:\s+(?<family>^.+?$)", RegexOptions.Multiline Or RegexOptions.Singleline).Groups("family").ToString, "\s+", "")

If not
result = System.Text.RegularExpressions.Regex.Replace(System.Text.RegularExpressions.Regex.Match(MyVar, "^\s*ENGINE\s+FAMILY\s+NAME.+?:\s+(?<family>^.+?$)", System.Text.RegularExpressions.RegexOptions.Multiline Or System.Text.RegularExpressions.RegexOptions.Singleline).Groups("family").ToString, "\s+", "")

Or
Use an Assign activity each time I mention it and set the variable to the type beween parentheses.

2 Likes

I did it as multiple separate lines like you did above and it worked!! Thank you so much!

I also need to extract the two numbers on either side of the decimal place in the family name, so in the case of this document “3.6”. Do you know how I would do this?
(Note: it is a different number for each document but the same format!)

1 Like

number = System.Text.RegularExpressions.Regex.Match(result, ".\..").Value

2 Likes

Thank you!! Also, I have been trying to extract both the purchase date and the in-service date separately, and I seem to be getting the in-service date for the purchase date too.

Do you know how I would extract the Purchase Date?

1 Like