I am trying to extract this piece of text from a pdf document using regular expressions. I am having difficulty with the format of this text. Any help is greatly appreciated!!
We will only be able to help with the regular expression if you can give us the text scraped from the document as a whole. If this text is extracted as “KDZXL03.6063” from the document, then you can find it in a regex by assigning a string value to
System.Text.RegularExpressions.Regex.Match(MyPdfString, "KDZXL03.6063").Value, where
MyPdfString is the entire string you’ve extracted from the PDF.
Here is a piece of the pdf document, I have been extracting the other pieces of data by assigning them variables, and extracting like this:
EngHorsepower Rating = System.Text.RegularExpressions.Regex.Match(readtextoutput,"(?<=Engine Horsepower Rating:).+").Value
Since this part I am trying to extract is a different format I am having trouble, but cannot directly quote it as you suggested because the string is different in each document. Hopefully this helps! Thanks!
is that format constant across all the files?I mean 5alphabets,2 digits,one dot and 4 digits?
Yes same format, just different letters & numbers for each document
then it should be easy. use read pdf activity and use the regex generator in ui path
which will give you collection of matching strings
Can you explain this further?
as @Anthony_Humphries said, we would need entire data scrapped from the pdf to proceed further
How would I use the regex generator to extract this specific part of the text?
Without knowing the surrounding text, it’s hard to say. If you can provide the pdf text that is output, that will help. If you would like to learn to do it yourself, I recommend these resources
This is an image of how the text is extracted (text around it) and I want to extract the Engine Family Name
This might do the trick, but I can’t be 100% sure without having the exact text extracted rather than an image. I’m assuming the text is stored in string variable
Try storing it in a string variable set to this:
System.Text.RegularExpressions.Regex.Replace(System.Text.RegularExpressions.Regex.Match(MyVar, "(?<=ENGINE FAMILY NAME \/ 12 Characters including any period \(\.\):\n\n).*$").Value, "\s", String.Empty)
Here is how I tested it in regex101.com:
Thank you! Unfortunately that didn’t work. Any suggestions on other things to try?
I would be good to have a text file with a sample of actual text.
System.Text.RegularExpressions to keep lines shorter. I split into multiple statements on purpose but you can condense into one line too.
pattern = "^\s*ENGINE\s+FAMILY\s+NAME.+?:\s+(?<family>^.+?$)"
options = RegexOptions.Multiline Or RegexOptions.Singleline
family = Regex.Match(MyVar, pattern, options).Groups("family").ToString
family = Regex.Replace(family, "\s+", "")
I recommend you to use the Form Extractor activity contained in Intelligent OCR package. It is easy to use, and it works really well to process scanned documents.
Here is an video about how to configure it
Thank you for this!! Can you show me how I would condense this into one line? Or how I would use them as separate statements?
If you import System.Text.RegularExpressions
result = Regex.Replace(Regex.Match(MyVar, "^\s*ENGINE\s+FAMILY\s+NAME.+?:\s+(?<family>^.+?$)", RegexOptions.Multiline Or RegexOptions.Singleline).Groups("family").ToString, "\s+", "")
result = System.Text.RegularExpressions.Regex.Replace(System.Text.RegularExpressions.Regex.Match(MyVar, "^\s*ENGINE\s+FAMILY\s+NAME.+?:\s+(?<family>^.+?$)", System.Text.RegularExpressions.RegexOptions.Multiline Or System.Text.RegularExpressions.RegexOptions.Singleline).Groups("family").ToString, "\s+", "")
Use an Assign activity each time I mention it and set the variable to the type beween parentheses.
I did it as multiple separate lines like you did above and it worked!! Thank you so much!
I also need to extract the two numbers on either side of the decimal place in the family name, so in the case of this document “3.6”. Do you know how I would do this?
(Note: it is a different number for each document but the same format!)
number = System.Text.RegularExpressions.Regex.Match(result, ".\..").Value