Hi, I am trying to datascrape a website and the results are unreliable, it does not seem to be able to extract all the rows from the websites datatable at all times in a consistent way.
This is the output I am generating:
Counters is the same as rows on the websites datatable.
Customer “Stadium Sverige” should equal 165 counters (rows) and not 100.
My first thought is that it is returning 100 rows because each page is limited to 100 rows and the datascraping activity needs to press a “next” button.
But this does not seem consistent because the customer “Apoteket” should equal 549.
The customer “KICKS” should equal 293 but it didn’t scrape any rows?
And finally the customer"CC Gjövik" is correct at 84 counters(rows).
What could be the cause for this inconsistent behavior?
The webpage is unfortunately a company workpage and behind restricted access, what kind of information were you looking for? I don’t know how to proceed with debuging this.
Does it always return the same number of counters? Or does it vary? For example, is Stadium Sverige always 100, or sometimes 110, 99, 125, etc?
One thing that came to mind is that maybe there are duplicate rows that are being overwritten during extraction? For example if your datatable says “Facility” should be unique, and it isn’t, maybe the values are overwritten the second time a row with that value is found?
You could try to run it two times, output the results to Excel, and compare to see if the results are identical. This won’t answer your question, but could help in determining the underlying cause.
Well this is certainly one of the weirder issues I’ve seen lately What bothers me is that the numbers vary from run to run. That excludes so many things that you’d usually suspect to be the problem…
Have you checked for this?
And can you check if the results are still varying if you use slow step or manual stepping? Perhaps giving the page more time to load before extraction might make a difference?
Update: I made a new blank process and started data scraping only one customer, the results were successful on the first run but then started to return 100 rows again.
I think the selector for the next button is the issue, I used UiExplorer to improve the selector but it did not solve the issue, I think it’s not working because the next button is located at the bottom of the page.
How can I improve the selector?
Right now it looks like this:
“< webctrl css-selector=‘body>main>div>div>div>div>section>div>div>div>div>div>div>div>div>ul>li>a’ parentid=‘page-content’ tag=‘A’ idx=‘7’ />”
I discovered that if I run process, it returns 100.
If I don’t close the opened web page from the previous run, it returns the correct number of rows.
Meaning if I add wait time after opening the browser it could work correctly, I will try this.
That selector looks problematic indeed. the “idx=7” will change of there are more or less “tag=‘A’” attributes on the page. For example, if there is only 1 or 2 pages, this idx will be different as well.
If you open it in UiExplorer, you should look for attributes that are specific to it, for example “class = btn_next” or something. Something that only refers to this specific button. You can also check to see if any of the elements above contain a specific attribute you can use. If you select one that is unique, you will see the “idx=7” attribute disappear from the selector.
In my example below, only the “TABLE” attribute isn’t unique, so it shows idx=2 (the second table on the page). By adding the parent ID, it becomes unique (there is only one TABLE inside a DIV that has class ‘dataTables_scrollBody’).
Hope this helps, sorry if I over-explained…
Weak selector:
Update:
After fixing the selector and adding a “on element appear” activity to wait for the page to load properly and avoid reading 0 rows I have a new issue.
The “On element appear” activity will correctly identify the table and run the code (I can see the pages being looped and scraped) but I will receive the error: “On Element Appear ‘TH’: Activity timeout exceeded”
I have these properties for the activity:
How do I proceed with this issue? I don’t understand the error as the activity seems to run correctly, even flashing the element with a red box as it is present.
The “element EXISTS” activity does not return the error but also did not properly trigger when the table was empty, how can I handle this type of issue?
The “On Element Appear” activity is considered “running” while all actions inside it are being performed. If the time it takes to run exceeds the timeout (usually 30 or 60 seconds), the activity will fail. You can edit the timeout (in milliseconds) property in your screenshot to play around with this.
This depends on what element you are checking. If the <TABLE> tag exists, but for example no
<TR> or <TH>, then your Element Exists might return TRUE for <TABLE>, while your scraping will still fail because there is no data. If you can observe which element is missing when the table loads incompletely, then that should be the element you check for.