How to Scrape, cut and filter

Hi,
I have a problem with managing information scraping from a web page.

I have a web page with different information.
Of this web page, I only need two details.
The name of the Project manager, and the name of the loaded offer.
I have no problem picking up the PM’s name.
But I find it very difficult to recognize and withdraw the name of the offer.

Unfortunately the scenarios are “infinite”.
The PM inserts a free text, and has no precise rules to respect.

I only have a few “hooks” to recognize the name of the offer.
Often the name of the offer is preceded by these letters:

CMT_PR
MKT
OPE
OV2
REYIT
QUO
OPX
OFF
OFFER
After these reads follows the symbol - or _ or empty space, and then a number of variable characters.

image

Below is the excel file for extracting the complete web page.
The name of the offer, are contained in row 22.
Is there any way to handle this situation?

Test.xlsx (8,9 KB)
Thanks…

Hi
may i know on which activity you were facing issue while extracting the offer details.
did we try with FIND CHILDREN activity with descandants as scope so that we will get the child element details, where the Ui Element attribute aaname is obtained with the help of GET ATTRIBUTE activity

we can then split or use string manipulation to get the offer value we want from that string

Cheers @AaronMark

The excel file writes the name of all PM entries in a single cell.
I don’t know how to cut the cell text so as to isolate only the offer name.
If you look at the excel file, line 22, it is perhaps clearer.
Inside line 22, anything can be written on it.
Also operational memos. Notes of the PM.
So, of all the lines he writes, I have to identify the name of the offer, relying on the abbreviations I have entered.

1 Like

Fine
i thought you are scrapping the data

as you have the data ready in a excel file this can be easily handled with the below steps
–use a excel application scope and pass the file path of the excel as input
–inside the scope use a READ RANGE activity and get the output dt
–then use a FOR EACH ROW activity and pass dt as input
–inside the loop use a IF activity with condition like this
System.Text.RegularExpressions.Regex.IsMatch(row(“your columnname”).ToString,"(?=CMT_PR|MKT|OPE|OV2|REYIT|QUO|OPX|OFF|OFFER).*(?=.pdf)")

if true it will go to THEN part where use a assign activity and get the offer name like this

str_offername = System.Text.RegularExpressions.Regex.Match(row(“your columnname”).ToString,"(?=CMT_PR|MKT|OPE|OV2|REYIT|QUO|OPX|OFF|OFFER).*(?=.pdf)").ToString.Trim

Cheers @AaronMark

uhnm … seems to work.
With a few minor changes, he started working.
Can I ask another question?
To make my work more precise.

Is there a way to prioritize the choice of data collected?

image
I tested another scenario.
The flow, rightly, takes the first field that contains the encoding that we have indicated.

In the example, attachment:
OFFERTA_TURNKEY_PROJET_VMWARE_EVENTO_COUPA_4977_LG_R1_signed

Test.xlsx (10,0 KB)

but, the name of the offer, main, would be the one indicated in the image in the second cell.
“Offerta Nr:…LG/1…”

How to identify?
The difference is that all “blue” writing is file. (.pdf mainly)
The rest are text strings.

In the file, line 22, you can understand why the extraction is written in full.
My question then is:
is there a way to priority search for lines that are not .pdf, (or other extention)
and if it does not find anything, then also search for lines ?

Hi Palani…
your flow works.
I have adapted it to incorporate most of the scenarios.
I have another couple of questions. (ignore the previous one - I solved it).

1: the list of words that we have included, can it not be case sensitive?
since writing is free at the discretion of the Project Manager, variations of the strings may be used.

2: I can’t collect the “else” scenario. That is, the project manager may not enter the name of the offer, therefore the data would be “missing”.
In these cases, I would like the bot to write “empty” or “none” as an alternative.

If I do this:
image

the bot always follows the “Empty” path, because it is inside a For Each Row loop. So there are actually lines where the offer is missing. But that’s not the correct way.
uhm …any idea?