Can you setup an exclude for the xml used for screen scraping?

Here is my question If I am using the below Extract Metadata in web scraping and I have two identical items getting picked up by Manufacturer. Is there a way I can set a tag to be a does not match option? I tired using

<webctrl tag!='i' />

But this did not work.

<extract>
<row exact='1'>
	<webctrl tag='tr' />
</row>
<column name='Manufacturer' attr='text' exact='1'>
	<webctrl tag='tr' />
	<webctrl tag='td' idx='1' />
	<webctrl tag='p' idx='1' />
</column>
<column name='Part Number' attr='text' exact='1'>
	<webctrl tag='tr' />
	<webctrl tag='td' idx='2' />
</column>

@LeftBrainCo
Unfortunately we dont have more details from your case (Screenshots, structures, sample data) that we can use for solution ideas.

Lets assume we cannot exclude (a few days back I did a longer RnD on the Extract XML and found out a lot of restrictions).

With a selector using the information on tag=‘i’ as an additional column, maybe the retrieved information can be used to do later a subtraction from other data
Example:
P= Hello World
I= World
P-I: P.replace(I, “”)

2 Likes

That will not work unfortunately as it would cause major issues with certain types of data like numbers ect.(Example below) Currently in these situations I run a loop that eliminates every odd row (Or whatever pattern is created by the lack of the information I am seeking). Did you find some sort of official documentation on Extract XML? I provided all of the information needed for the question I asked. I am looking to learn more about using Extract XML, not a workaround for a problem that might not exist (Missing exclude control).

For example

@LeftBrainCo

Short answer:

  • syntax for denying tags: NO
  • options on reliable extraction the information: YES

Longer answer below:

Did you find some sort of official documentation on Extract XML?

we do find more triggers on the miss of the extractMetadata XML instead of having an overview to the supported tags and attributes

A little time back I did a heavy RnD on exploring the possibilities and it found out that we can do only limited things that we otherwise can do with selectors (e.g. nav up for anchoring)

So in short:

  • a syntax for denying tags is not available or known.

And also I am not expecting that regex selectors will be supported, but you can by your own.

I am looking to learn more about using Extract XML, not a workaround for a problem that might not exist (Missing exclude control).

As we have to assume that another option /approach is needed,for information retrieval , we can look on other working options. and there are Options available

I provided all of the information needed for the question I asked.

  • Yes you did for the YES/NO Question - Is there a way I can set a tag to be a does not match option?
  • But as it is not available and you need to check other options, more details from your case are to known

That will not work unfortunately as it would cause major issues with certain types of data like numbers ect.(Example below) Currently in these situations I run a loop that eliminates every odd row (Or whatever pattern is created by the lack of the information I am seeking).

Just provide us details on what we can rely for solution develloping, instead of incomplete part information on what is not working

What we did in several cases in such scenarios was

  • grab the data with datascrapping and configured ExtractMateadata XML + plus on additional column.
  • iterate again over the datawebtable with a find children and collect the inner/outerhtml string
  • parse the string with the XML API/ metrhods and pullout the information as by requirement
  • store the extracted info on the forseen column
2 Likes

This is all I needed thanks!

This topic was automatically closed 3 days after the last reply. New replies are no longer allowed.