Data Scraping_Multiple Questions


#1

Hello All,

I am currently navigating to a health website and scraping MD names and other pertinent info.

My inquiry is three part.

1.) When I am extracting a large amount of names there are instances where it will not extract all the data per page. However, I found the way to get around this is to include a fixed delay in the ‘DelayBetweenPages’ property & change the ‘WaitForReady’ to Complete.

Keep in mind, there are instances where there can be dozens of pages. Is there any other way I can put in a requirement where there has to be a fully loaded page and extract all line items before moving on? Using a fixed delay takes too long.

2.) For some reason, the column names are not being included in the Append to CSV activity. Only the line item content. Any work around for this?

3.) Finally, there are some formatting I want to do to some of the rows BEFORE appending to csv…is this possible?

One of the columns will include a phone number…except, the bot is also scraping the ‘Phone’ tab next to the actual number.

For instance, it will pull the number ‘(###)-###-#### Phone’…can I trim the phone before inputting into csv?

Thanks again for all the help.


#2
  1. Can you check for a Image Visibility(logo etc) using Image Exists which validates the Page Load?

  2. May be you can use “build data table” and add columns before adding Web content?

  1. Yes, you would be able to manipulate the data using substring,split etc.

#3

Hi vvaidya,

1.) Is the ‘image visibility’ the most reliable method to use? I would use this prior to the extraction and have the extraction inside of an IF statement? only extracts if the output of visibility is set to true?

2.) Its odd that sometimes the output will include the header name but others it wont. What would be the method for doing this?


#4

There could be better solutions to validate page load, you can try InjectJs activity to see if the page is loaded or not.

for eg: if (document.readyState === “complete”) { //loaded }

I’m not sure if it works, just a thought.


#5

Hmm interesting though, however I have never done this.

How would you include this in the sequence? Before the actual extract structured data activity include the ‘Inject JS Script’ and in the line include the above statement?

I tried such and it gave me an error saying ‘End of expression expected’


#6

Hi vvaidya,

I am trying to extract form this site…https://connect.werally.com/plans/uhc

Once I navigate here, I click on ‘All United HealthCare Plans’…then choose ‘Choice Plus’ on next page and input ‘75202’ for the Zip Code prompt that populates after then press ‘Continue’. Finally, I input ‘Family Practice’ under the search tab and click on the ‘Family Practice - Specialities’ option…this will bring me back my list which I will finally use the data scraping tool. Dependent upon search radius I will have more potential results. I notice that the more available results the longer I need to put ‘DelayBetweenPages’ in order for me to pull all results.

I have reached out to UiPath and they don’t have any answer on how to extract all results. I dont want to put in another post here but this is important and I think only you have any idea on how to do so.

Can you help me out?


#7

I will try and let you know.


#8

Why not WaitForReady = Complete?


#9

@badita He mentioned that WaitForReady did not work for his page. So I was suggesting some alternatives.

I will try to scrape his website when I have time and see how it goes.


#11

Badita,

I have already tried this approach. However, it doesn’t extract all information as required thus I am trying to find alternative solutions.

The only solution that seems to work consistently is changing the delaybetweenpages property to a sufficient enough time load before extracting data. You would figure the WaitForReady would solve for this but I haven’t any success thus far.


#12

Thanks vvaidya,

You are indeed correct. I would assume the WaitForReady would do the trick, however haven’t had any consistent luck.

Perhaps its the actual framework of the website? It seems like it only tabs over a container that houses all the data whilst the actual main page itself doesn’t change so the WaitForReady method is actually working but its not pointing to the correct tag?

Just talking out loud at this point because I am a little lost on this one.

Thanks for all the help.