Text parsing from news outlet


#1

Dear community members,

this should be an easy one for you, but since I’m completely new to this programme, I’m completely lost.
I already tried the search and watched some tutorials, but neither helped.

My problem: I’m conducting research on the European Union and wanted to automate the text parsing.

A first site in question is https://www.premier.gov.pl/en/news/news.html and on this site the unique news articles. Of each news article back to September 2014 I want to parse the URL, the date, heading and text.

Using data scraping didn’t work because I only was able to scrape the “first layer/ overview” (what one can see when clicking the above mentioned link.

Using Screen Scraping didn’t work because I could’t figure aut how to “loop” it for other articles than one specifically chosen.

I’d be so grateful and glad if someone could help me. I guess it is one of the most basic tasks to do, but I really tried for some hours and well, didn’t proceed.

Thank you in advance and sorry for possible spelling mistakes - english is not my first language.

Yours sincerely

Cornelius


#2

@CoVe

Please follow below steps-

  1. Use data scrap and retrieve date, Heading and URL (these data can be retrieved from the page that opens on clicking the link you provided.
  2. Open each news from the URLs captured in step 1.
  3. Read news text
  4. Continue step 2 and 3 until news text from all the URLs are read.

Hope this helps!


#3

Dear @Madhavi,

thanks for your kind reply!

  1. I went to the initial URL and news_page_38 while using “Data Scraping”.
    By that I think I managed to extract Date, Heading, URL - as you said.
    2./3./4.: What exactly do you mean by"open each news text" and “read news text”?

Do you imply that I can only extract Date, Heading and URL and that I have to extract the news text content manually?

The first command was "Extract data ‘X0 to Xn’ and extract correlated data ‘Y0 to Yn’ and extract correlated data ‘Z0 to Zn’ ".
Isn’t it possible to aditionally tell UiPath "and extract correlated data ‘body text of URLs correlated to X0 to Xn’?

Thank you so much

all the best

Cornelius


#4

From the first step, you will not be able to retrieve the entire content of the news. You can retrieve that when you click on each news to open it and then have to read the content. To open each news, you can get the URL for the news in data scrapping itself.

Automation will work exactly the same way how a person does. In this case, even a person has to visit each page to read data. Same way is for BOT.