I am new to UiPath / Regular Expressions. If anybody can help, would highly appreciate.
Customers send their orders in pdf which I am converting to text and trying to extract the PO Numbers…all customers have their own way of sending the PO Numbers.
Below is my RegEx
(P.O. NUMBER|P/O Number|Purchase Order No.|P.O. NUMBER|PURCHASE ORDER NO.|PURCHASE ORDER NO|PO Number|Purchase Order|P.O.|PO\W)(?!.Box|BOX)(?!.Total|TOTAL)(\s?#?:?\s?)(.+)
1st Group – P.O. NUMBER|P/O Number etc. filters text starting with PO Number etc.
2nd & 3rd Group – (?!.Box|BOX)(?!.Total|TOTAL) - should not select PO Box or PO Total
4th Group - (\s?#?:?\s?) - Any special character after PO Number like PO Number # or PO Number :
5th Group - (.+) - Anything after the special characters, actual PO Numbers - 1234
Issue is - this Regular Expression is filtering few words - PROX / PADUCAH / PAOUCAH
These words are present in pdf files but not sure why are these getting selected.
I am passing each line of pdf (one at a time) into the RegEx expression. So the input of RegEx is nothing but a text line which I showed in the screenshot above. In text format - below are the lines…
CITY NAME, STATE 40051 * PAOUCAH, STATE H2003 *
PREPAID/ALLOWED 2% 10 PROX NET 45 DAYS
Ohh I was extracting text like PO Number: 123 or P/O No. #356.
All those PO Numbers were getting selected but some extra words were also getting selected like I mentioned - PROX / PAOUCAH etc.
Figured it out after spending whole day in it
Thanks for the revert though
Still testing the modified RegEx
All it needed was a backslash "" wherever there was a dot “.” in Group 1
Below is the new RegEx
(P.O. NUMBER|P/O Number|Purchase Order No.|P.O. NUMBER|PURCHASE ORDER NO.|PURCHASE ORDER NO|PO Number|Purchase Order|P.O.|PO\W)(?!.Box|BOX)(?!.Total|TOTAL)(\s?#?:?\s?)(.+)