Data Scrape specifics

I have encountered a problem where additional data is added when I am data scraping a site, which I do not want. I have tried to adjust the selectors but it seems pretty difficult to make the scrape not grab the unwanted information.
That is why I am wondering if there is any way to create some form of if statement or something, to check the excel sheet where the data scraped information is added, to see if there are any patterns, and in that case delete those.
I will post pictures below on what I mean so it is easier to understand!
Hopefully this is not too impossible to fix

From this:

To this:

Data scraped web site:

1 Like

To answer my own question:

I found a solution and I will post it here for others in case of same issue.

My solution was to, instead of data scrape use a screen scrape.
However by doing so I could not use the “Write range” that was applied to previous data scrape.
All I did was use the “Generate data table” activity and use the output from the “Screen scrape” activity to generate.

Hope this can be helpful to someone in the future

2 Likes

It’s best way. let me know this website. but you can’t?
use substring function and change row data.

try this sample

ChangeText.zip (6.3 KB)

1 Like

@ddochea

Thank you for your response!

I am not quite sure how your automation works, but I found a solution already using screen scraping rather than data scraping and I think that might be a better solution in this case for what I am creating.
I appreciate your help and solution though and if you want to see the website for yourself here it is:
https://we2020.kouryakuki.net/players/detail/13662/

Sorry. I’m late. I was on my way home from work. :sweat_smile:

Try this. I tried to make it as similar as possible to your Uploaded Image.
I hope to helpful.

Solution.zip (271.5 KB)

2 Likes

Screen scrape might work for this case, but keep in mind one thing for Data Scraping:

If you don’t previously build a datatable and define the schema (which columns and the data type in each column), Data Scrape will automatically try to create one, which leads to bad behaviour like this. In this case, it probably tried to make an Int32 out of the 2nd column (which fails with the +4), but there are other situations where this might bite you (specifically, with text that gets interpreted as dates)

Do not worry about it!

Thank you for a clear explanation. I am pretty new to automation and RPA in general so there is a lot to try and understand…
I understand a bit more now looking at your automation, but in my case the screen scrape gets the work done! I will however use your example for future projects, thank you very much

Sorry I am not quite sure I understand what you mean, could you elaborate on what it means to previously build a datatable and define a schema? I have never done that before and if it is something that should be used often I am more than happy to learn.

When you use Data Scraping, the Extract Structured Data activity outputs the results in a DataTable variable. If this DataTable wasn’t created before, it will create it at that moment.

However, by creating it beforehand, you can define what type of data will be in each column. Easiest way is to use activity “Build DataTable”, and set each column to the correct data type (in your case, String).

Incidentally, this also goes for printing the output to a Workbook. Excel tends to do the same thing and will, for example, remove leading zeroes on a “number” (this can be a huge issue if you’re dealing with, for example, Serial Numbers) if the Cell you’re writing to hasn’t been formatted previously.

Thank you for the detailed explanation!

I am starting to understand a little bit more of what you mean and I will try to use the activity “Build DataTable” next time and see how it works.

So far I have only used Data scraping and let it create data tables on its own and it has worked so far, but if there ever are any problems I know what to check first.

This topic was automatically closed 3 days after the last reply. New replies are no longer allowed.