Text parsing from news outlet

CoVe · August 16, 2018, 7:42pm

Dear community members,

this should be an easy one for you, but since I’m completely new to this programme, I’m completely lost.
I already tried the search and watched some tutorials, but neither helped.

My problem: I’m conducting research on the European Union and wanted to automate the text parsing.

A first site in question is https://www.premier.gov.pl/en/news/news.html and on this site the unique news articles. Of each news article back to September 2014 I want to parse the URL, the date, heading and text.

Using data scraping didn’t work because I only was able to scrape the “first layer/ overview” (what one can see when clicking the above mentioned link.

Using Screen Scraping didn’t work because I could’t figure aut how to “loop” it for other articles than one specifically chosen.

I’d be so grateful and glad if someone could help me. I guess it is one of the most basic tasks to do, but I really tried for some hours and well, didn’t proceed.

Thank you in advance and sorry for possible spelling mistakes - english is not my first language.

Yours sincerely

Cornelius

Madhavi · August 17, 2018, 4:45am

@CoVe

Please follow below steps-

Use data scrap and retrieve date, Heading and URL (these data can be retrieved from the page that opens on clicking the link you provided.
Open each news from the URLs captured in step 1.
Read news text
Continue step 2 and 3 until news text from all the URLs are read.

Hope this helps!

CoVe · August 20, 2018, 11:02am

Dear @Madhavi,

thanks for your kind reply!

I went to the initial URL and news_page_38 while using “Data Scraping”.
By that I think I managed to extract Date, Heading, URL - as you said.
2./3./4.: What exactly do you mean by"open each news text" and “read news text”?

Do you imply that I can only extract Date, Heading and URL and that I have to extract the news text content manually?

The first command was "Extract data ‘X0 to Xn’ and extract correlated data ‘Y0 to Yn’ and extract correlated data ‘Z0 to Zn’ ".
Isn’t it possible to aditionally tell UiPath "and extract correlated data ‘body text of URLs correlated to X0 to Xn’?

Thank you so much

all the best

Cornelius

Madhavi · August 23, 2018, 7:13am

From the first step, you will not be able to retrieve the entire content of the news. You can retrieve that when you click on each news to open it and then have to read the content. To open each news, you can get the URL for the news in data scrapping itself.

Automation will work exactly the same way how a person does. In this case, even a person has to visit each page to read data. Same way is for BOT.

Topic		Replies	Views
Help with advanced Google search Studio uiautomation	4	1465	December 22, 2021
Extract News on Daily Basis Only Front Page Studio studio , question , activities_panel	1	556	February 21, 2023
Data_Extraction from a website Help browser , activities , data_scraping , web , question	6	1194	December 5, 2019
Data scraping from a website(latest news) Activities selector , uiautomation , studio , data_scraping , string , question	1	1075	October 22, 2020
Is this possible in UI path? Studio uiautomation	6	526	October 28, 2022

Text parsing from news outlet

Related topics