To extract a field from unformatted document

lissynikkytha · August 4, 2017, 4:04am

Hi,

How to extract a field from structured documents when the format is not standard? For example, you will have customer number present anywhere and can be spelled as either customer# or customer number. Since position is not known, we cant scrape it. Please suggest your ideas.

akhi_s27 · August 4, 2017, 5:41am

RegEx is what we use.
We have some keywords in excel which get updated after every successive failure.
Eg. Customer# and “Customer Number” are tow keywords that are dynamically used to build the regex pattern.
when there is a case where the existing keywords do not fetch the value, it implies that a new variation is received. this new variation is added in the excel and next time onward it works. This way, over time the keywords get updated and the chances for failure gradually decrease.

lissynikkytha · August 9, 2017, 6:16am

But mine is a pdf document and not an excel. Is it possible to extract only the numeric portion in a variable? If we could solve it, then i believe i will be able to crack this. “\d” in Matches didnt help me with extracting the numeric part. Any other alternative is available?

akhi_s27 · August 9, 2017, 7:13am

Yes. Extract from a pdf based on regex where patterns are dynamically taken from an excel.
Is it possible for you to upload a sample?

lissynikkytha · August 9, 2017, 1:27pm

Cannot upload due to highly sensitive data. Let me put it this way for better understanding. If a variable to going to get data dynamically say for example “abc1234 bc”, “45678”, “xyz1234”, “uvw 98766”. I need to extract only the numeric portion" alone.

sfranzen · August 9, 2017, 10:26pm

If you just want to match every group of digits, you should use the pattern \d+. But if there are any other numbers in addition to the customer number anywhere in the text, these will all be matched. This is extremely likely to be the case, so then how do you know which is which?

If you want to get more specific matches, you’ll have to use as much context as you can to build the regular expression(s). Some things that can help:

Is a particular single string, such as “customer”, always guaranteed to appear somewhere in front of the number in the extracted text?
If not, could you exhaustively list strings that will be followed by the number you want?
Does the customer number always appear at the end of a line?
Are customer numbers of a particular format, for example always exactly 5 digits, or always between 4 and 6 digits?

Your example strings do not provide much context to go on, so if you can please give some samples of actual text you want to process. It is of course no problem to replace any sensitive data, as long as the replacements are of exactly the same format.

Also, I encourage you to test your own patterns against sample input here, to get a feeling for what you can do with regular expressions. This website has the same options and syntax that you can use in UiPath, and also has a handy syntax reference.

lissynikkytha · August 16, 2017, 8:45am

Hi,

Below are some formats in which the field will be populated.

a) Placed with: Address abc# 123456
b) COF# 5674545
c) ACCT# 7900001 MO

So, the customer number can be followed by keywords - “Placed with” or “ACCT” or “COF”. Special characters # may or may not be present. I need to extract the numeric data that follows one of the keyword and placed between 2 Spaces or the numeric data followed by “#” and spaces. For example in the below case, i need to extract the customer number 987654.

Placed with: Bentonville, AR-72712 987654 Date:08/01/17

sfranzen · August 17, 2017, 12:30pm

Well, that’s quite challenging!

If you can be sure these are the only keywords you will ever need, you could include them explicitly in a single regex like (?<=(Placed with:|COF|ACCT).*?\s)\d+(?=\s|\Z). This one looks for a number that most directly follows either of those keywords, has whitespace before it and whitespace or end-of-string after it. It detects the right number in each of your examples, but it will fail if:

A different number surrounded by spaces appears before the customer number (a false positive result), such as if your last example had been written

Placed with: Bentonville, AR 72712 987654 Date:08/01/17
There is a customer number but it’s preceded by a different key word/pattern (a false negative).

In order to have a maintainable solution to these errors, I strongly suggest you follow @akhi_s27’s advice and keep a list of strings or regex patterns in an external (Excel or CSV) file. A potentially useful trick: use String.Format to replace parts of a string or pattern. For example, have a string in your workflow like basePattern = (?<={0}.*?\s)\d+(?=\s) and form the regexes on the fly with pattern = String.Format(basePattern, myKeyword).

lissynikkytha · September 6, 2017, 6:55am

Sorry for the delayed response. Thank you. It works.

Topic		Replies	Views
Unable to extract unstructured-data from PDF file Help datatable , excel , selector , uiautomation , pdf , studio , data_scraping , error , string	7	1398	September 12, 2020
Extraction Academy Feedback studio	12	913	May 23, 2019
Extract from multiple pdf Studio activities , studio	10	1163	July 29, 2022
Data extraction using Taxonomy Studio studio , question , activities_panel	9	681	July 23, 2022
Looking up numbers in PDF, cross-referencing in excel to retrieve another number and renaming the file with that number in a different folder Studio studio , question	6	652	May 27, 2022

Most Active Users - Yesterday
ashokkarale
Anil_G
Stef_99
mkankatala
sesa499170
mrohan.senapaty
Hazem_Saleh
Aswin_Sutheesh17
aravinthan.k
Anived_Mishra
More details...

To extract a field from unformatted document

Related Topics