Error in data scraping process

matthias.harenburg · March 5, 2021, 3:00pm

Hi!
I have a pretty simple task I want to automate. The bot has a defined list of search terms. With this it should enter a online news search-machine and scrape - for each search term - the top three appearing news (Headline, publication platform, publication date, URL…). The results are to be entered into a second excel file.
All in all the automation flow is working. However, the data the bot scrapes do not cover all search-terms but only the first. So, the bot does perform the search for all (16) search terms. But the scraping result is always (for each of the 16 searches) the result for the first term only.
What could I have done wrong?

Srini84 · March 6, 2021, 5:18am

@matthias.harenburg

Check that if you have limited the results by 1 from the properties of Extract structured data

Hope this helps you

Thanks

matthias.harenburg · March 8, 2021, 12:23pm

Thanks for the idea @Srini84 but that is not the issue. I have set the number of results to be scraped to 3. The bot does scrape the first three data entries of what it has found after the search. But it just repeats writing these three entries into the results file when itering through the “for each row” section. If it were correct, it should provide 16x3 different results (three results for in total 16 searches).

ppr · March 8, 2021, 12:52pm

give a try on setting the scope of the extractdata datatable variable to a higher scope
May we also to give us some more details to the part where the merge datatable is done (var names, implementation …) thanks

matthias.harenburg · March 8, 2021, 1:50pm

Hi Peter,

I am attaching the json of that automation to this message. Does it provide you with the information you are looking for?

Thanks Matthiasproject.json (1.0 KB)

ppr · March 8, 2021, 1:52pm

@matthias.harenburg
the project.json will not give us the needed details to your flow and requirements.

As mentioned from above, share with us details on:

source (url, screenshot)
expected result / output

Thanks

matthias.harenburg · March 8, 2021, 1:56pm

The data source is a public news page for Korean news (news.naver.com). The automation searches this page for a list of defined search terms (e.g. 지멘스 헬시니어스). It should scrape information from the first three news that appear on the respective search term (Headline, publication date, publication platform, URL of the platform) and write the result into an Excel list.

As said: all in all it works. The only error is that it does not write the results for the 16 defined search-terms into the excel but repeats the results of only the first search term 16 times.

ppr · March 8, 2021, 2:01pm

perfect, we do progress for the analysis.

As mentioned,

check the scope of the extractdata datable var
critical check the merge datatable part

do a debug and check if also the different results are retrieved

matthias.harenburg · March 8, 2021, 2:18pm

Being a beginner I have put the scope of every used variable for the whole flowchart. So all variables (incl. the variable for the extractdata datatable) span the whole process.

I did a debug, stopped the flow after every step (“step over”) and checked the values in the locals tab. The error already occurs at the datatable step. So the datatable of the extractdata part does always output the same data for each search term. It is not an error of writing the data from the datatable to the excel. Already the datatable is wrong.

matthias.harenburg · March 9, 2021, 8:30am

Please find below a step-by-step description of how the automation is designed:

Build DatTable (Var. “NaverSearch”) which contains the elements that should in the end be scraped by the bot
within the Excel application scope: “read range” of an existing excel file with the different search terms to be searched for (resulting in the data table Var. “SearchTerms”)
open browser and navigate to “news.naver.com”
for each row in Var. “SearchTerms” type the search term into the search field of news.naver.com
extract structured data (URL, Headline, PublicationPage, PublicationDate) in the tab that is opening when the search is being performed. The result of the data scrape is a new datatable (Var. DataExtraction)
Merge datatable DataExtraction into the datatable NaverSearch
write range NaverSearch into an Excel file
close the browser

Topic		Replies	Views
Adding The Searched Term to a Web-Scraped Table Help activities , data_scraping	6	1489	April 30, 2019
Extract Search results from web Studio studio , question , workflow_diff	7	973	May 6, 2022
For this site with hotels I can only get the result from the first page when scraping Help activities , data_scraping , question	7	791	November 20, 2020
[Help] Data Scraping doesn't work well Activities uiautomation , activities , question	6	809	October 11, 2021
Extracting Data problem StudioX excel , uiautomation , studiox , question	7	742	October 28, 2022

Error in data scraping process

Related topics