Scraping data from HTML table with no ID or distinct class

Hi!

I’m new to UiPath and currently trying to scrape data from google patent. So far I’ve gotten most of the data needed by constructing slectors based on HTML id:s and classes. But one table from which I need to pull data doesn’t have a unique ID or a distinct class.

See this page for an example:
https://patents.google.com/patent/US9791861

I need to get all the data from the first column of the table under the “Cited By” h3 header. As you can see, the table itself and all the links share the same class as most other links on the page. So far I haven’t been able to target this table specifically and I’m all out of ideas. The goal is to get each patent number in this table into a single string in UiPath…

The header above the table, which says “Cited By”, has an ID as you can tell. So I tried using an anchor base to look for the element directly beneath it, but I couldn’t get it to work and I don’t know if it’s the right kind of solution.

Any help would be greatly appreciated! Either if someone could help me with a solution or just point me in the right direction in terms of what activitites I should use.

1 Like

@NikEng

welcome to the forum

a rough check gave following:
with datascraping we have a good starter option to retrieve all cited by items by Publication numbers (including first 2 ones and the group of Family to family Citations)

However I would suggest to setup this retrival on a base of find children activity as it can be done more reliable.

  • in general we are interested on the div following to the h3: Cited By (x)
  • this div can be later scraped on a more detail level

On the first look I dont see any major blockers for a retrieval

In case of you need more help, so let us know. Happy automation :slight_smile:

1 Like

Hi,

Thank you for your reply!

I tried using data scraping to get the citations, and it works in some cases.

See the attached file, it’s a working solution, at least for some patents. But if I try with another patent it doesn’t seem to work.

A patent which works (the one in the file):
https://patents.google.com/patent/US9802638B1

A patent in which the solutions doesn’t work:
https://patents.google.com/patent/KR20130131497A

I rebuilt the data scarping activity based on the patent that didn’t work and the selector ended up being the same. The only difference was the “idx” parts where there was a minor difference. So I ended up removing the IDX-parts of the selectors. I then got results from both pages, but the problem was that it didn’t stop at the “Cited by” table, but kept going through all the links following it including “Similar Documents”.

Could you explain a bit more how I would go about using the “find children” activity? Would I go about finding all the children of the div in question? Meaning the rows in the table in the div? I’m thinking that such a solution would still require me to point out the div following the H3 in some was, which I don’t know how to do.

Or should I use the find children acitivty to find all divs in the page, and then somehow poiont out the div folliwing the H3?

I’m not really following you on how to apply the “find children activity”.

Main.xaml (10.0 KB)

@NikEng
I will have a Look on the Second Not working Link unser the Viewpoint of doing this with find children.

As far i did Understand you only want to retrieve the cited by items.

Give ne some time and i will come Back with a feedback

1 Like

@NikEng

based on following assumptions:
grafik

  • scrap the Patentinfo under CitedBy
  • ommit the Patentinfo under Family To Family Citations Section

we would have this structure:

and can work on it like following:
red lines

  • get H3 CitedBy element
  • get the Parent Div to the h3
  • fetch the div after h3
  • fetch the div with the class ‘tbody style-scope patent-result’ (d1)

blue line

  • find all divs under d1 having classname ‘tr style-scope patent-result’ (d2List)
  • find in d2List the div with innertext Family to … (yellow marks) - FamilyDiv
  • reduce d2List from start till FamilyDiv

green line

  • iterate over the remaining divs from d2List and retrieve Patentinfo from Link

Find here an initial implementation showcasing the scraping in general:
NikEng.xaml (14.7 KB)

  • swithing to another browser can be done by rework the attach browser selector
  • it is recommended to do some stabilizations on the flow as well
  • for demo purpose the patentinfo is retrieved into a string list

Let us know your feedback

2 Likes

Hi ppr,

Sorry for the late reply, I’m in the middle of an exam week.

I want to start by thanking you for taking the time and giving me a solution to this problem. I would have been very unlikely to figure this out myself. Super appreciated!

When I opened the file, all of the “find children” activities came up as missing. Perhaps I did not have the same packages installed as you did (I’m on an academic alliance version of UiPath).

However by looking at the xaml file in notepad I was able to recreate the acitivities with the same parameters as you had used. This gave an extra advantage of me going through the logical strucutre in depth, which was very educational.

So big thanks for helping me solve the problem as well as giving me a lot of insights into how to do scraping based on relative elements!

This topic was automatically closed 3 days after the last reply. New replies are no longer allowed.