Issue With Getting Multiples Line of a PDF Using Regex/UIPath

Hey There Everybody! I recently just started an internship as a RPA developer this year and I have ran into an issue trying to scrape data off of a PDF. I cannot give you the details of the PDF, so I will go ahead and give you a fake scenario:

Say I have a PDF for ordered foods. I have read in the data into a variable and have split the PDF data using a Substring and Environent.NewLine and have been able to access the data using a for each loop. My code looks something like this:

For Each:
CurrentItem IN Split(PDFSubstringWithItems, Environment.NewLine)
Log Message: CurrentItem

With this, I have been able to gain access to each line of data I need. The Data looks something like this: (Let’s say it is a PO for a grocery store lets say)

1 FGHI876 590843 BG 10.00 $25.00 $250.00
Organic Green Apples
2 POJQ3498 78654 BG 5.00 $25.00 $125.00
Fresh Ripe Oranges
3 MNGET4321 09473 BG 4.00 $8.00 $32.00
Frozen Angus Beef Burgers Grass Fed

1 is the line number, FGHI876 is the item code and everything following it on that line is as associated as such. The next line is the item description. My issue is this: Is there a good regex expression that would be able to get each item individually, both item codes and item description? I have a one idea but I am not sure how efficient it is.

1), I know I have to use a loop, and by checking for “BG” in the line, I am able to get each item code and related details just fine. I figure I can continue to split the string into the necessary values that I need. However, this doesn’t get the item description, which is information that I need.

This is the first time I’ve posted on here, but I have lurked on here a bunch of times to help solve other issues I’ve had. I appreciate any and all feedback from more experienced developers! Thank you so much!

Hi @Mike_Brown1 ,

Try the following regex, the group 1 will give you the item code and group 2 will be the description:

\d+\s(\w+).*\n([a-z A-Z]+)

image

Access each group of matched item by using a for each item on the collection of matches

image

Regards,

1 Like

Hi @Mike_Brown1 ,

Is the requirement only for Item Code and Item Description ? Or do we need to capture the other values as well ? If so , maybe you could check the below Expression :

^(\d+)\s+(\w+)\s+(\w+)\s+(\w+)\s+(.*?)\s+(.*?)\s+(.*?)\s(.*)

Also, the data provided seems that you have not used the PreserveFormat enabled when reading the PDF Text, Do check if enabling that could help you better.

1 Like

This topic was automatically closed 3 days after the last reply. New replies are no longer allowed.