Extract Specific Info from PDF

I am reading purchase orders and need to extract certain information from the PDF. I have attached the PDF example. For each ‘Item’ seen in the table at the bottom of the PDF I need to enter the item number into a website so I can add that item to a cart. Each PO could have a different number of Items that need to be added. I also need the quantity corresponding to each item number.

I was thinking of using Read PDF and then using RegEx but I am unsure what my expression would be or if that’s the best approach. Any advice is appreciated.
PO Example 2_Multi Line.pdf (92.7 KB)

1 Like

Hi,

Hope the following sample helps you.

Sample20220113-1.zip (83.8 KB)

Regards,

1 Like

Thank you, Yoichi! This is very helpful! Can you explain what each activity is doing so I can understand and learn from your sample? It is doing what I need it to do when I run it but I’d like to understand each step.

Hi @kasey.betts

You can understand the usage of each activity by looking into the UiPath Documentation which is given below.

You can looking to specific activity category by scrolling down on the left side of the above page (Title: GETTING STARTED)

Hope this will be useful. Thank you.

I mean the pieces in the regex expressions not the actual UiPath activities.

Hi,

Alright. Roughly as the following.

System.Text.RegularExpressions.Regex.Matches(strTarget,"(?<=\n)\d{4}\s[\s\S]+?(?=\n\d{4}\s|\nTotal|\r?\n\r?\n)")

This regex extracts strings which starts with 4 digits and ends with just before next 4 digits , “Total” or 2 linebreaks. It returns result as MatchCollection type

Then iterate the above result using ForEach.

System.Text.RegularExpressions.Regex.Match(item.Value,"(?<C>\d+)\sLot\s(?<E>[.,\d\$]+)\s(?<G>[.,\d\$]+)\s(?<H>[.,\d\$]+)\s(?<I>[.,\d\$]+)\sBase",RegexOptions.RightToLeft)

This extracts value of Column C,E,G,H,I with RegexOptions.RightToLeft option. Because Column B seems free format and it’s better to extract it as remaining, i think.Please note this expression assumes “Lot” and “Base” always exists. If other word might exist, we need to consider it.

item.Value.Substring(0,4) is for Column A

Remaining is ColumnB. We can get it using position from Regex.Capture.

i=m.Captures(0).Index
item.Value.Substring(5,i-5)+item.Value.Substring(i+m.Captures(0).Length)

Hope this helps you.

Regards,

Already @Yoichi given the explanation for regex used in the flow.

To know more about regex, please look into the below one.

Thank you

Thank you Yoichi for your explanation! Unfortunately they’ve given me a different PDF. I attached it here with the needed information highlighted. The number of items at the bottom of the PDF will vary, depending on what each person is ordering. The Item number (0001 and 0002 in this case) will always be 4 digits but the highlighted part number under description may vary.
PO Example 1_highlighted.pdf (991.4 KB)

Can you help me with this? Is regex still the best solution?

This topic was automatically closed 3 days after the last reply. New replies are no longer allowed.