To extract a field from unformatted document

Hi,

How to extract a field from structured documents when the format is not standard? For example, you will have customer number present anywhere and can be spelled as either customer# or customer number. Since position is not known, we cant scrape it. Please suggest your ideas.

RegEx is what we use.
We have some keywords in excel which get updated after every successive failure.
Eg. Customer# and “Customer Number” are tow keywords that are dynamically used to build the regex pattern.
when there is a case where the existing keywords do not fetch the value, it implies that a new variation is received. this new variation is added in the excel and next time onward it works. This way, over time the keywords get updated and the chances for failure gradually decrease.

But mine is a pdf document and not an excel. Is it possible to extract only the numeric portion in a variable? If we could solve it, then i believe i will be able to crack this. “\d” in Matches didnt help me with extracting the numeric part. Any other alternative is available?

Yes. Extract from a pdf based on regex where patterns are dynamically taken from an excel.
Is it possible for you to upload a sample?

Cannot upload due to highly sensitive data. Let me put it this way for better understanding. If a variable to going to get data dynamically say for example “abc1234 bc”, “45678”, “xyz1234”, “uvw 98766”. I need to extract only the numeric portion" alone.

If you just want to match every group of digits, you should use the pattern \d+. But if there are any other numbers in addition to the customer number anywhere in the text, these will all be matched. This is extremely likely to be the case, so then how do you know which is which?

If you want to get more specific matches, you’ll have to use as much context as you can to build the regular expression(s). Some things that can help:

  • Is a particular single string, such as “customer”, always guaranteed to appear somewhere in front of the number in the extracted text?
  • If not, could you exhaustively list strings that will be followed by the number you want?
  • Does the customer number always appear at the end of a line?
  • Are customer numbers of a particular format, for example always exactly 5 digits, or always between 4 and 6 digits?

Your example strings do not provide much context to go on, so if you can please give some samples of actual text you want to process. It is of course no problem to replace any sensitive data, as long as the replacements are of exactly the same format.

Also, I encourage you to test your own patterns against sample input here, to get a feeling for what you can do with regular expressions. This website has the same options and syntax that you can use in UiPath, and also has a handy syntax reference.

Hi,

Below are some formats in which the field will be populated.

a) Placed with: Address abc# 123456
b) COF# 5674545
c) ACCT# 7900001 MO

So, the customer number can be followed by keywords - “Placed with” or “ACCT” or “COF”. Special characters # may or may not be present. I need to extract the numeric data that follows one of the keyword and placed between 2 Spaces or the numeric data followed by “#” and spaces. For example in the below case, i need to extract the customer number 987654.

Placed with: Bentonville, AR-72712 987654 Date:08/01/17

Well, that’s quite challenging! :wink:

If you can be sure these are the only keywords you will ever need, you could include them explicitly in a single regex like (?<=(Placed with:|COF|ACCT).*?\s)\d+(?=\s|\Z). This one looks for a number that most directly follows either of those keywords, has whitespace before it and whitespace or end-of-string after it. It detects the right number in each of your examples, but it will fail if:

  • A different number surrounded by spaces appears before the customer number (a false positive result), such as if your last example had been written

    Placed with: Bentonville, AR 72712 987654 Date:08/01/17

  • There is a customer number but it’s preceded by a different key word/pattern (a false negative).

In order to have a maintainable solution to these errors, I strongly suggest you follow @akhi_s27’s advice and keep a list of strings or regex patterns in an external (Excel or CSV) file. A potentially useful trick: use String.Format to replace parts of a string or pattern. For example, have a string in your workflow like basePattern = (?<={0}.*?\s)\d+(?=\s) and form the regexes on the fly with pattern = String.Format(basePattern, myKeyword).

1 Like

Sorry for the delayed response. Thank you. It works.

1 Like