Best way to extract a field from a chunk of text?

I have a pdf document translated into raw text via Read PDF Text. I want to extract a particular field in the middle of the raw text

E.g.

Some text here
Some text here
Buyer Address:
(Variable text to be extracted)
Seller Address:
Some text here
Some text here

Currently I am using two String.Split activities… It works but it is not very elegant. What is the method to use one Substring activity instead?

Regex is the best (regular Expressions)

a simple one could be like Buyer Address (.*) Seller Address

What’s the full line of code to extract the field?

using Assign

Buyer_AD_AR = Regex.Matches(Output_Variable,“Buyer Address (.*) Seller Address”)

Where Buyer AD_AR is of type System.Text.RegularExpressions.MatchCollection

(make sure namespace of regular expression is imported in your project to use the above)

Or you could use the matches activity

How do I extract the String which is the Output_Variable?

or do I need to convert System.Text.RegularExpressions.MatchCollection to String?

Output_Variable is your variable from Read PDF Text

Do I use a .ToString method on the MatchCollection type?

Hi @DEATHFISH,

refer this post

Regards,
Arivu :slight_smile:

Yes

once you have the value in Buyer_AD_AR, again use assign for

Buyer_AD = Buyer_AD_AR(0).Groups(1).ToString

Where Buyer AD is type String

I remember one of the practice exercises used only one Assign activity with a Substring… what was the method used there?

Hi @DEATHFISH,

Please use an assign activity, str_buyer_address= System.Text.RegularExpressions.regex.Match(your_PDF_Text,"(?<=Buyer Address:)[\s\S]*(?=Seller Address:)").ToString
It will return all the text in between “Buyer Address:” and “Seller Address:” as string.

Warm regards,
Nimin

1 Like

(see latest post)

Ok how do I modify it to extract text within brackets?
E.g.

Service fees (Jan 2017 to Jun 2017)

I want to extract only the part which says “Jan 2017 to Jun 2017”

Does it work with brackets as well? Again, need this to work within one Assign activity.

Ok I am using this expression in a different workflow, now it says “Regex is not declared. It may be inaccessible due to its protection level”. How do I resolve this?

import the namespace in your project

System.Text.RegularExpressions

How do I import namespace?

Hi @DEATHFISH

This should help:

I am now encountering the following error:

18.4.0+Branch.master.Sha.b805b316b1c47ae06c0fe7e619b9c9f96e9e774c

Source: Assign

Message: Specified argument was out of the range of valid values.
Parameter name: i

Exception Type: System.ArgumentOutOfRangeException

An ExceptionDetail, likely created by IncludeExceptionDetailInFaults=true, whose value is:
System.ArgumentOutOfRangeException: Specified argument was out of the range of valid values.
Parameter name: i
   at System.Text.RegularExpressions.MatchCollection.get_Item(Int32 i)
   at lambda_method(Closure , ActivityContext )
   at Microsoft.VisualBasic.Activities.VisualBasicValue`1.Execute(CodeActivityContext context)
   at System.Activities.CodeActivity`1.InternalExecuteInResolutionContext(CodeActivityContext context)
   at System.Activities.Runtime.ActivityExecutor.ExecuteInResolutionContext[T](ActivityInstance parentInstance, Activity`1 expressionActivity)
   at System.Activities.InArgument`1.TryPopulateValue(LocationEnvironment targetEnvironment, ActivityInstance activityInstance, ActivityExecutor executor)
   at System.Activities.RuntimeArgument.TryPopulateValue(LocationEnvironment targetEnvironment, ActivityInstance targetActivityInstance, ActivityExecutor executor, Object argumentValueOverride, Location resultLocation, Boolean skipFastPath)
   at System.Activities.ActivityInstance.InternalTryPopulateArgumentValueOrScheduleExpression(RuntimeArgument argument, Int32 nextArgumentIndex, ActivityExecutor executor, IDictionary`2 argumentValueOverrides, Location resultLocation, Boolean isDynamicUpdate)
   at System.Activities.ActivityInstance.ResolveArguments(ActivityExecutor executor, IDictionary`2 argumentValueOverrides, Location resultLocation, Int32 startIndex)
   at System.Activities.Runtime.ActivityExecutor.ExecuteActivityWorkItem.ExecuteBody(ActivityExecutor executor, BookmarkManager bookmarkManager, Location resultLocation)

This happens when I want to retrieve the string from the MatchCollection… I am sure that the text file I am searching contains the start delimiter, end delimiter and some text in between. What might be causing this? Yesterday I tried it and it worked fine, I only encountered this today

First activity: Regex.Matches(text,“Start text (.*) End Text”)
Second activity: x(0).Groups(1).ToString

Check your Regex pattern on sites like https://regexr.com/

Put your full string there and the pattern and you can confirm whether the pattern works or not

In case things in your output string changed (like space between words/positions) the regex will fail. So you need to design regex in such a way that it can handle all possible shifts in the string as well (at least all expected based on tests)

How to modify the regex to include line breaks? (.*) the dot matches any character except line breaks