Issues with datascraping website

Hi, I am trying to datascrape a website and the results are unreliable, it does not seem to be able to extract all the rows from the websites datatable at all times in a consistent way.

This is the output I am generating:
image

Counters is the same as rows on the websites datatable.

Customer “Stadium Sverige” should equal 165 counters (rows) and not 100.
My first thought is that it is returning 100 rows because each page is limited to 100 rows and the datascraping activity needs to press a “next” button.
But this does not seem consistent because the customer “Apoteket” should equal 549.

The customer “KICKS” should equal 293 but it didn’t scrape any rows?

And finally the customer"CC Gjövik" is correct at 84 counters(rows).

What could be the cause for this inconsistent behavior?

I have set the delay between pages to: 15000 ms.

Hi @Eric46 ,

what is the url of webpage ?
Tried with url from your output but it is asking to login and no signup.

The webpage is unfortunately a company workpage and behind restricted access, what kind of information were you looking for? I don’t know how to proceed with debuging this.

Show me the webpage as well as scrapped elements

@Eric46
could it be the case that the extract data Activity is still set to default MaxNumberOfResult = 100?
grafik

as mentioned in the hover set it to 0

Scraped elements are the rows in the table, like:
567 - Extern modem 001 - Counter Irisys O IO v201908.2 Europe/Stockholm

No unfortunately not because for example the customer “Apoteket” it grabs 195/549 rows.
And for customer “KICKS” it grabs 0/293

Is your “Next” selector works perfectly for pagination?

As it just have Previous and Next buttons you can check untill next become disabled.

Check for any static data in attach browser in Data scrapping.

Does it always return the same number of counters? Or does it vary? For example, is Stadium Sverige always 100, or sometimes 110, 99, 125, etc?

One thing that came to mind is that maybe there are duplicate rows that are being overwritten during extraction? For example if your datatable says “Facility” should be unique, and it isn’t, maybe the values are overwritten the second time a row with that value is found?

You could try to run it two times, output the results to Excel, and compare to see if the results are identical. This won’t answer your question, but could help in determining the underlying cause.

I made 3 runs under the same conditions and ended up with the following:

Customer(website datatable) Counters (rows) run 1 Counters (rows) run 2 Counters (rows) run 3 Actual amount of rows
Stadium Sverige 0 100 100 165
Phone House 0 93 93 93
Apoteket 95 200 200 549
KICKS 395 295 380 293
CC Gjövik 84 84 84 84
Jula AB 100 100 100 118

So there are three errors happening:

  1. no rows were retrieved at all.
  2. rows on the next page were not retrieved (the webpage maxes out at 100 rows per page).
  3. It’s retrieving more rows than are present (KICKS run 1 & 3)

Well this is certainly one of the weirder issues I’ve seen lately :sweat_smile: What bothers me is that the numbers vary from run to run. That excludes so many things that you’d usually suspect to be the problem…

Have you checked for this?

And can you check if the results are still varying if you use slow step or manual stepping? Perhaps giving the page more time to load before extraction might make a difference?

Update: I made a new blank process and started data scraping only one customer, the results were successful on the first run but then started to return 100 rows again.
I think the selector for the next button is the issue, I used UiExplorer to improve the selector but it did not solve the issue, I think it’s not working because the next button is located at the bottom of the page.

How can I improve the selector?
Right now it looks like this:
“< webctrl css-selector=‘body>main>div>div>div>div>section>div>div>div>div>div>div>div>div>ul>li>a’ parentid=‘page-content’ tag=‘A’ idx=‘7’ />”

I don’t understand this sentence: “check for any static data in attach browser in Data Scraping”
What do you mean static data?

The attach browser activity looks like this:

In UiExplorer it looks like this:

The URL property is unselected but can it still cause a problem? The URL should not be /1066 (this is the customer ID)

Update:

I discovered that if I run process, it returns 100.

If I don’t close the opened web page from the previous run, it returns the correct number of rows.
Meaning if I add wait time after opening the browser it could work correctly, I will try this.

That selector looks problematic indeed. the “idx=7” will change of there are more or less “tag=‘A’” attributes on the page. For example, if there is only 1 or 2 pages, this idx will be different as well.

If you open it in UiExplorer, you should look for attributes that are specific to it, for example “class = btn_next” or something. Something that only refers to this specific button. You can also check to see if any of the elements above contain a specific attribute you can use. If you select one that is unique, you will see the “idx=7” attribute disappear from the selector.

In my example below, only the “TABLE” attribute isn’t unique, so it shows idx=2 (the second table on the page). By adding the parent ID, it becomes unique (there is only one TABLE inside a DIV that has class ‘dataTables_scrollBody’).

Hope this helps, sorry if I over-explained…
Weak selector:


Strong selector:

Thank you, this explanation helped a lot!

Update:
After fixing the selector and adding a “on element appear” activity to wait for the page to load properly and avoid reading 0 rows I have a new issue.

The “On element appear” activity will correctly identify the table and run the code (I can see the pages being looped and scraped) but I will receive the error: “On Element Appear ‘TH’: Activity timeout exceeded”

I have these properties for the activity:
image

How do I proceed with this issue? I don’t understand the error as the activity seems to run correctly, even flashing the element with a red box as it is present.

The “element EXISTS” activity does not return the error but also did not properly trigger when the table was empty, how can I handle this type of issue?

I currently have this workflow:
image

The “On Element Appear” activity is considered “running” while all actions inside it are being performed. If the time it takes to run exceeds the timeout (usually 30 or 60 seconds), the activity will fail. You can edit the timeout (in milliseconds) property in your screenshot to play around with this.

This depends on what element you are checking. If the <TABLE> tag exists, but for example no
<TR> or <TH>, then your Element Exists might return TRUE for <TABLE>, while your scraping will still fail because there is no data. If you can observe which element is missing when the table loads incompletely, then that should be the element you check for.