Data Scraping and Write to Excel Worksheet

I am developing an RPA tool that scrapes data from a site, manipulates the scraped text using regex and then writes back to an excel worksheet.

The above text is an example of the text I will be scraping. I have used a Get Full Text activity to select the Div that this text is under. Although the resulting text returns a lot of whitespace and special characters.

The regex I have tried using does not seem to return the desired result I am after.

Using the Replace activity I tried the following:
- Other.*

  • I want to remove all text after ‘- Other’

^.*?(\d{4})

  • I want to remove all text up to the first occurring 4 digit number (ie. 2008 in the above example)

I then used a Matches activity which has the following regex:
(-\s+)[^-]

  • I used this to break int lines with a dash character (-) into separate lines.

I then have an Excel Application Scope and I iterate through each item of the previous output to write back to the excel worksheet. Although the resulting text is not what I had expected the regex to return.

Hi @dwalker

Is this the HTS codes you are after?

If so, there are multiple places where you can export the codes directly to an Excel file.

Check this out:
https://circabc.europa.eu/w/browse/23cc0022-41e9-4ec4-b16f-158f645eca46

image

It might make it easier for you.

1 Like

This is extremely helpful and another option for me. For example, If I am to search for Goods Code 68101190.

The description may contain more than one line with the same number of ‘-’ characters. I will, in all cases, only need to copy the last line of a number of dashes. For example, in the below screenshot, there are two rows with - - -, I will only want to include the final line. What would be a way for me to check for this and ensure that I will not get two rows like this?

Thank you :slightly_smiling_face:

1 Like

Well, I suppose if your goal is to always extract the line before the Other (= thus, always the previous row), you could find out the index of the row that contains the Other and subtract 1 from it :slight_smile:

1 Like

Thanks @loginerror.

I have tried 10+ different methods to the following: I need to search for the value stored in one excel in the Nomenclature excel and then copy all rows back to the original excel.

It is explained in the link below if you have the time I appreciate it.

Thanks!