Extract specific String

Hi. I brought all text from PDF file by UiPath activity,
and I want to use regex and extract these items;

In row, first item is index.
Second item, which is underlined, is product name.
Third item is quantity of product, and fourth is price per product.
Fifth item, which is underlined, is total price. Others are extra.

I need second item and fifth item.
If have any good idea, please share that. Thanks.

@ysshin.temp - Is it possible for you to share the text file?

@ysshin.temp - Please try this…

To extract the 2nd Item
image

To extract the 5th Item
image

Thanks, but upload image was just one of example.
For using RPA machine, I have to make common format to extract items.

Second, third, fourth, fifth, sixth(that is note for explanation, so it can exist/not exist) is variable.
Just First item(index) is constant.

1 VMWare vCenter Server 6 SupportSubscription 1 1,800,000 1,800,000
2 VMWare vSphere 6 Enterprise Plus SupportSubscription 6 2,150,000 12,900,000
3 VMWare vSphere 6 Standard SupportSubscription 6 390,000 2,340,000
4 VMWare vSphere 6 Standard SupportSubscription 6 6 390,000 2,340,000 비고내용입니다. 2,340,000
100 VMWare vSphere 6 Standard SupportSubscription 6 6 100 100 2,340,000 비고입니다.

I attach the text of example again, so can you give good idea again?
Thanks.

for 2nd item
(?<=^\d{1,}\s).*?(?=\s\d{1,}\s)
for 5th item
[\d,]+\r

you can get string list like below

txt <- source text

Assing Activity
pattern1 = “(?m)(?<=^\d{1,}\s).*?(?=\s\d{1,}\s)”
pattern2 = “[\d,]+\r”

list1 = Regex.Matches(txt,pattern1).Cast(Of Match).Select(Function(m) m.Value).ToList
list2 = Regex.Matches(txt,pattern2).Cast(Of Match).Select(Function(m) m.Value).ToList

Hi @park363 - With all due respect, It didnt pick the value in the last row…amount we need to capture is 100 here…(As per the screenshot)…

image

I am still scratching my head to get this regex…its very very tricky…

Plus for the first pattern 4th and fifth row we need to capture up to 6 another tricky place

I came so close for the first pattern, but 5th line is not perfect…check this…

image

Actually a simple split function in an assigm would do it.

Hi…@moenk … with what character you would split in this case? See the user requirement above …I am interested to see …

Its obivously seperated by spaces, so if you split, you grab 2nd, 3rd, 5th item from resulting array. Do this in a loop over an array of the lines, for this need split the entire thing by CrLf first of course.

@moenk… I am afraid It won’t work …please look closely again for the 2nd element he asked…it has so many spaces …see below

1 VMWare vCenter Server 6 SupportSubscription 1 1,800,000 1,800,000…

Sorry, you are right, so we have to split on “SupportSubscription” first, but this makes it not so beautiful any more. I’d look at the process now if there is another format available.

1 Like

I did not check last 2 lines.
It is very tricky.

Thanks for your help.

I will remain and save first item to fifth item with regex like this:
[0-9]+ [a-zA-Z0-9가-힣 ]+ [0-9]+ ([0-9]{1,3},)([0-9]{1,3}) ([0-9]{1,3},)([0-9]{1,3})

Then, I will use your idea.
It is tiresome, but I don’t know how to extract at once.

Thanks for your help.

1 Like

Thanks for your help.

@ysshin.temp - If the 2nd string would have ended consistently like the first 3 rows, I think there is a possibility. Since it is ending with 6 followeup by another integer makes the pattern tougher.

1 Like

This topic was automatically closed 3 days after the last reply. New replies are no longer allowed.