Extract HTML Table from nth Row using Data Scrapping in UiPath

I have a table in a web page with let’s say 10 rows with two columns: Col1 and Col2.

How can I extract data from nth Row row to mth Row, for example I want to extract data from 3rd row to next 5 rows i.e. till 7th row using data scrapping in UiPath.

1 Like

Hi

Welcome to uipath community

We wont be able to do that screening while performing data scrapping

But once after getting the complete data using data scrapping we will be having the data in a datatable

In that datatable we can do the filteration as we need

If we want to remove the First 4 rows with no condition involved, You can use this Assign Statement:
DT = DT.AsEnumerable.Skip(4).CopyToDatatable

And in the same DT to get top 7 rows

Use this in a assign activity

Dt = Dt.AsEnumerable().Take(7).CopyToDataTable()

Hope this would help you
Pls let know for any further queries or clarification

Cheers @Meghdut_Saha

Hi

Thank you for your reply, the issue is actually the HTML table is very huge with approx 90000+ rows and it is in the single page, and the scrapping (both modern and classic) fails with error “HRESULT E_FAIL has been returned from a call to a COM component” while using the wizard. So I thought of scrapping the data with first 1000 rows and then again 1000 rows and so on and finally merge into a data table when all the rows are scrapped.

I have also tried using Find Children activity and GetText Activity after that and I am able to get the data. But it takes huge time to loop through 90000+ uiElements from the Find Children activity. So though of any fastest way by data scrapping.

We can limit number of rows to be scrapped in the data scrapping wizard, but can we manipulate metadata to provide any condition to scrape from 1001th row in the second time for next 1000 rows, 2001th row in the third time for next 1000 rows etc.

1 Like

We encountered that HTML/XML Text processing was faster executed as compared to find children

give a try on

  • ectract the Table HTML by get Attribute Activity - Attribute outerhtml
  • process the HTML e.g. with XML Api and extract the table data

Let us know if you do need further help on this.

In case of URL is public please share it with us

Great
Try using find children and get the table as mentioned below instead of using get text

@Meghdut_Saha

@Meghdut_Saha

working with: https://nxtgenaiacademy.com/webtable/
find starter help here:
ExtractData_XMLApproach.xaml (9.5 KB)

1 Like

In your flow after build data table activity which activity did you use, it is appearing as missing activity in my workspace. Do I need to install any package for that? I am using studio 2020.10.2. Please elaborate on that activity.

I have already mentioned that I have tried this and it was successful, but you can imagine how much time it takes to loop through each children when the table has more than 90000+ rows. I need a faster solution to scrape the data as per business demand.

I’m sorry the data is private so can not share but if you share the detailed flow, I can try and let you know if it is working for me. The example you have given, I have similar kind of table with more rows and columns.

UiPath.Web.Activity Package was referenced / added via Manage packages

just ensure that build datatable is defining the rigth number of columns
the rest of code is dynamic and will iterate over all td from tr

Hi

I have tried your way but getting an error while deserializing, Can you help me a bit.

Though I can’t give you the URL, but I have created a small table similar to my requirement and the html code is given below (As a new user, it is not allowing me to attach). Please have a look.

<table cellpadding="0" cellspacing="1" class="ex" uipath_custom_id="1">
	<thead>
		<tr>
			<th>TransID</th>
			<th>Date/Time</th>
			<th>CSR name</th>
			<th>Account Currency</th>
			<th>Amount</th>
			<th>From</th>
			<th>To</th>
			<th>Comment</th>
		</tr>
	</thead>
	<tbody>
		<tr>
			<td align="right"><a href="" target="_blank">123456</a>&nbsp;</td>
			<td align="right">3/14/2021 9:52:07 AM</td>
			<td align="right">System</td>
			<td align="center">USD</td>
			<td align="right">0.50</td>
			<td align="left" bgcolor="Yellow"><a href="" target="_blank">maskedUser</a>
			</td>
			<td>&nbsp;</td>
			<td>Approved</td>
		</tr>
		<tr>
			<td align="right"><a href="" target="_blank">259426</a>&nbsp;</td>
			<td align="right">3/21/2021 12:03:09 PM</td>
			<td align="right">System</td>
			<td align="center">USD</td>
			<td align="right">-200.00</td>
			<td>&nbsp;</td>
			<td align="left" bgcolor="Yellow"><a href="" target="_blank">maskedUser</a>
			</td>
			<td>Approved</td>
		</tr>
		<tr>
			<td align="right"><a href="" target="_blank">259496</a>&nbsp;</td>
			<td align="right">3/22/2021 12:02:17 PM</td>
			<td align="right">System</td>
			<td align="center">USD</td>
			<td align="right">215.00</td>
			<td align="left" bgcolor="Yellow"><a href="" target="_blank">maskedUser</a>
			</td>
			<td>&nbsp;</td>
			<td>Approved</td>
		</tr>
		<tr>
			<td align="right"><a href="" target="_blank">259953</a>&nbsp;</td>
			<td align="right">3/28/2021 12:24:27 PM</td>
			<td align="right">System</td>
			<td align="center">USD</td>
			<td align="right">-215.00</td>
			<td>&nbsp;</td>
			<td align="left" bgcolor="Yellow"><a href="" target="_blank">maskedUser</a>
			</td>
			<td>Approved</td>
		</tr>
		<tr>
			<td align="right"><a href="" target="_blank">260687</a>&nbsp;</td>
			<td align="right">4/7/2021 1:25:32 AM</td>
			<td align="right">System</td>
			<td align="center">USD</td>
			<td align="right">450.00</td>
			<td align="left" bgcolor="Yellow"><a href="" target="_blank">maskedUser</a>
			</td>
			<td>&nbsp;</td>
			<td>Approved</td>
		</tr>
		<tr>
			<td align="right"><a href="" target="_blank">260722</a>&nbsp;</td>
			<td align="right">4/7/2021 1:15:56 PM</td>
			<td align="right">System</td>
			<td align="center">USD</td>
			<td align="right">215.00</td>
			<td align="left" bgcolor="Yellow"><a href="" target="_blank">maskedUser</a>
			</td>
			<td>&nbsp;</td>
			<td>Approved</td>
		</tr>
		<tr>
			<td align="right"><a href="" target="_blank">260737</a>&nbsp;</td>
			<td align="right">4/7/2021 4:23:42 PM</td>
			<td align="right">System</td>
			<td align="center">USD</td>
			<td align="right">2,500.00</td>
			<td align="left" bgcolor="Yellow"><a href="" target="_blank">maskedUser</a>
			</td>
			<td>&nbsp;</td>
			<td>Approved</td>
		</tr>
		<tr>
			<td align="right"><a href="r target="_blank">260981</a>&nbsp;</td>
			<td align="right">4/10/2021 2:39:09 PM</td>
			<td align="right">System</td>
			<td align="center">USD</td>
			<td align="right">500.00</td>
			<td align="left" bgcolor="Yellow"><a href="" target="_blank">maskedUser</a>
			</td>
			<td>&nbsp;</td>
			<td>Approved</td>
		</tr>
		<tr>
			<td align="right"><a href="" target="_blank">260983</a>&nbsp;</td>
			<td align="right">4/10/2021 3:06:40 PM</td>
			<td align="right">System</td>
			<td align="center">USD</td>
			<td align="right">100.00</td>
			<td align="left" bgcolor="Yellow"><a href="" target="_blank">maskedUser</a>
			</td>
			<td>&nbsp;</td>
			<td>Approved</td>
		</tr>
		<tr>
			<td align="right"><a href="" target="_blank">261609</a>&nbsp;</td>
			<td align="right">4/18/2021 3:19:05 PM</td>
			<td align="right">System</td>
			<td align="center">USD</td>
			<td align="right">2,037.00</td>
			<td align="left" bgcolor="Yellow"><a href="" target="_blank">maskedUser</a>
			</td>
			<td>&nbsp;</td>
			<td>Approved</td>
		</tr>
		<tr>
			<td align="right"><a href="" target="_blank">261609</a>&nbsp;</td>
			<td align="right">4/18/2021 3:19:59 PM</td>
			<td align="right">System</td>
			<td align="center">USD</td>
			<td align="right">20.00</td>
			<td align="left" bgcolor="Yellow"><a href="" target="_blank">maskedUser</a>
			</td>
			<td>&nbsp;</td>
			<td>Approved</td>
		</tr>
		<tr>
			<td align="right"><a href="" target="_blank">261621</a>&nbsp;</td>
			<td align="right">4/18/2021 6:08:03 PM</td>
			<td align="right">System</td>
			<td align="center">USD</td>
			<td align="right">700.00</td>
			<td align="left" bgcolor="Yellow"><a href="" target="_blank">maskedUser</a>
			</td>
			<td>&nbsp;</td>
			<td>Approved</td>
		</tr>
		<tr>
			<td align="right"><a href="" target="_blank">262125</a>&nbsp;</td>
			<td align="right">4/25/2021 3:47:14 PM</td>
			<td align="right">System</td>
			<td align="center">USD</td>
			<td align="right">2,925.00</td>
			<td align="left" bgcolor="Magenta"><a href="" target="_blank">maskedUser</a>
			</td>
			<td>&nbsp;</td>
			<td>Approved</td>
		</tr>
		<tr>
			<td align="right"><a href="" target="_blank">262132</a>&nbsp;</td>
			<td align="right">4/25/2021 5:26:00 PM</td>
			<td align="right">System</td>
			<td align="center">USD</td>
			<td align="right">1,950.00</td>
			<td align="left" bgcolor="Magenta"><a href="" target="_blank">maskedUser</a>
			</td>
			<td>&nbsp;</td>
			<td>Approved</td>
		</tr>
		<tr>
			<td align="right"><a href="" target="_blank">262192</a>&nbsp;</td>
			<td align="right">4/26/2021 2:48:37 PM</td>
			<td align="right">System</td>
			<td align="center">USD</td>
			<td align="right">2,915.00</td>
			<td align="left" bgcolor="Magenta"><a href="" target="_blank">maskedUser</a>
			</td>
			<td>&nbsp;</td>
			<td>Approved</td>
		</tr>
		<tr>
			<td align="right"><a href="" target="_blank">268112</a>&nbsp;</td>
			<td align="right">7/25/2021 3:39:43 PM</td>
			<td align="right">System</td>
			<td align="center">USD</td>
			<td align="right">-20.00</td>
			<td>&nbsp;</td>
			<td align="left" bgcolor="Yellow"><a href="" target="_blank">maskedUser</a>
			</td>
			<td>Approved</td>
		</tr>		
	</tbody>
	<tfoot>
		<tr>
			<td align="right" colspan="3">	<b>∑=&nbsp;</b>
			</td>
			<td align="center">USD</td>
			<td align="right">15,892.50</td>
			<td colspan="3">&nbsp;</td>
		</tr>
	</tfoot>
</table>

You can ignore the <tfoot> as that is not required and even can ignore the <th> as that is known and fixed.

Sure will help you. Just one question before
When you open manually the above provided URL (was done in FireFox, feel free to adopt it to Chrome or other browser), was the starter help xaml running or not at your end?

Hi

I am running it in Chrome, though the error occured for both Chrome, Firefox and Edge browser. However the issue is resolved, just formatted the HTML string to replace "&nbsp;" to empty string and the deserialization is successful and the data table was created. Thank you for showing me the correct way, it is a great help to me.

I am marking your post as solution.

Thank you again.

lets try to add some more perfection on top.
Does your datatable has th elements on begin?

No the data table has one blank row in the bigining. Rest of the table is perfect and it has the footer row as well. I have created the columns as same as the table in the Build Table activity.

ok, was just thinking about using the th elements for dynamic constructing the datatable.
But if its working then it is fine.
Just do your final testing and let us know the result by marking the solving post as solution or your further open questions. Maybe you also can us give some feeback on No of rows, processing time of the datatable extraction.Thanks

I have tested this with 14 users I have now who has huge data and it is working fine as of now. So marking it as a solution. Will catch up again if I face any issue in future.

Also I find it faster than normal data scrapping wizard available in uipath.

Thanks for your help @ppr

Hi

Just need to resolve one thing, the amount column is getting converted into String, is there any fastest way to convert it back to Int32. I need those in Int32 format. I have tried ChangeDataColumn activity but it generates error cannot convert DbNull to integer. Any idea? @ppr

give a little time.

Can you show me actually LINQ From i in… Statement which is in use
And give some details on columnstructures and how the empty target structure is set up