Iterate Through Text File to Find and Replace Similar Words

I have pdf file in which I scrape the entire document into a Text String. I then parse out the data by Labels so that I can capture the Field Information taken from the pdf. Example of field captures: Company ID, Start Date, End Date, Address, Phone, Email, etc. For the most part my string manipulation and capture process is flawless as long as the “Labels” can be defined as unique.

Unfortunately, this pdf has 4 different labels all titled: "Address: " The problem I am running into is that even though I specifically identify where each “Address:” is located within the text stream, the robot stops at the first one and then captures the wrong information. I have been using Google OCR to do relative scraping, but that process is sketchy at best.

I have searched through the community help and tried several things I found, but nothing specific to resolving my issue. What is happening currently when I attempt to use System.Text.RegularExpressions.Regex.Replace is replaces all the “Address:” labels with “Address_1”.

What I would like to do (if possible) is to iterate through the Text Stream, Find the First instance of “Address:” and replace it with "Address_1; 2nd Instance with Address_2, 3rd Instance with Address_3, etc. Until all 4 instances of the word have been renamed. In so doing I can created a unique index by which my string capture will be able to find the correct addresses to pull data from.

Does anyone know if this can be done within a Text File?

Thank you in advance for your help!

You can use the click OCR text activity possibly. This allows you to indicate which occurrence of “Address” you want to replace.

image

If you were able to count the occurances first then use this activity in a do while loop you could increment through the occurrence number using a counter variable until all occurrences have been replaced.

@ronanpeter, Thanks for your reply! Yes… was aware of this activity and tried it with no success. The initial Text is scraped directly from a PDF into a String Variable (and there are other OCR Relative scrapes I have to do), but for the most part I am using a Match Activity with the following code: "(?<=" + "Audit By:" + ")(.*?)(?=" + "Audit Date:" + ")" to find the values between two labels. It works flawlessly, … until… I have capture values between an "Address: and “[Some other Label]” then it fails. Since posting this info yesterday, I have been working on using VBA (or possibly a VBS) script which will do the Replacements. In testing that this morning, I was able to do this using the Text Scrape from the PDF.

I would really like it if I could figure out how in UiPath using an activity to find and replace the first instance of “Address:” I could then put that in a Loop which would replace all the address labels according to their order with the new text. Don’t want to have to invoke a code to this, but looking right now like that is the only option.

I would not recommend using OCR, as that takes a long time and is prone to failure. You can instead use Regex for the entirety of it. (edit incoming).

If determined to work the way you have mentioned (first replacing address instance with “nth” instance so address_1, address_2…address_n) Then you can do so with IndexOf or by using stringbuilder as mentioned here: https://stackoverflow.com/questions/10194228/replacing-nth-occurrence-of-string

Otherwise, you can use Regex.Replace to get rid of the Nth number as mentioned here: https://stackoverflow.com/questions/27589325/how-to-find-and-replace-nth-occurrence-of-word-in-a-sentence-using-python-regula - if you do this method you’d put it into a loop so the loop counter would be the number you include instead of hardcoding 0, 1, 2…n

Also to clarify the whole issue with the PDF is that it is apparently a scanned copy so what is happening is that it selects the entire PDF Document instead of individual elements even when I make the required changes in Acrobat. The elements selection only works for that session. If I close the PDF, and reopen it, it goes back to being unable to find the elements I selected. Hence the reason I am using a Text Scape and String Manipulation to accomplish the goal.

@Dave… I do most certainly agree, but in this particular PDF, the data I need is actually located underneath the labels instead of between two labels. So I have do both text manipulation and OCR. Ugh.

If that’s the case (not on same line as the label), then I’d recommend splitting by new line and iterating through the newly created array and use index/loop counters and flags to know when to change items.

So it would find the label, flip a flag, then if flag = true then update or grab data

Here is an example of the part of the PDF that requires OCR:

So in looking at your response above you state: “I’d recommend splitting by new line and iterating through the newly created array and use index/loop counters and flags to know when to change items.”

The data in the pic is being captured in the initial text scrape off the PDF… so… if it’s part of the Text Stream variable, I suppose I could convert the text to a CSV and then pull the values out of the array based on array number?

ya this is vastly different then what i was expecting when you mentioned a PDF. Can this be sent in a different format? Seems strange that 1). it’s shown as an image and 2). that it’s coming in PDF at all. It looks like it could be provided as either a flat file or CSV or Excel formats instead.

I had assumed it would not be in a nice table format like this sorry. My recommendation assumed it would be a more randomized format with much more lines.

Just so I’m understanding could you have more than one row in that table? So there could be 2+ Qty, 2+ item/product, etc?

“Just so I’m understanding could you have more than one row in that table? So there could be 2+ Qty, 2+ item/product, etc?”

Yes… unfortunately, I haven’t gotten to how to handle that piece yet. Most of the robot actions for divvying out the variables was / is done based on the PDF only containing a single Row Qty. Recently, however, I was presented with PDF that had two… so back to the drawing board on how to account for that scenario. But the main issue (keeping this on track my original post) is being able to convert each 'Address: ’ into it’s unique label.

“Can this be sent in a different format?”
Ha! In a perfect world… yes. But for now, I will be sent these PDF’s internally by another department which contains the client contractual data (which is the reason I can’t supply the entire pdf document online here). I will be pushing the information I capture to SQL Server for storage and retrieval purposes. And this is only a fraction of a very large overall UiPath process in which I am just in the beginning stages.

@Dave ~ Thanks for all your help and assistance here today… I think that if I have need of this, someone else furture wise could benefit from the discussion.

This PDF is shown as an image, correct? So you’re using OCR to read the entire image and convert to a string? If it is showing as text I’d highly recommend using the read pdf text activity instead.

I don’t see Address anywhere in the image you posted. However, one way to do it (not necessarily the most efficient, but your examples are so small it doesn’t matter) is to split the entire text by pretty much anything (space, new line, whatever works for your case), replace the text within, then join back into a single text

The above would be something like this::

Read PDF into single string called str1
Assign SplitString (of type string()) = str1.Split(" "c)
Assign LoopCount (of type integer) = 1
For each str in SplitString
If str.contains("Address") Then str.replace("Address","Address_"+LoopCount.ToString)
Assign LoopCount = LoopCount + 1
End if
Next str
Assign Str1 = Strings.Join(SplitString," "c)

This will loop through everything and change it from Address to Address_n.

1 Like