I have a XML metadata for extracting data from a table. The current XML looks like the following:
<extract>
<column css-selector='.price-wrapper' name='Price' attr='text' />
<column css-selector='.card-reduced-amount' name='ReducedAmount' attr='text' />
<column css-selector='li[data-testid="property-meta-beds"]' name='Beds' attr='text' />
</extract>
The challenge I am facing now is that I am getting some of the “Beds” cell result into the ‘ReducedAmount’ column.
What would be the best solution to keep the ‘Beds’ data inside the Beds column?
Thank you so much for your help in advance!
Yoichi
(Yoichi)
April 25, 2023, 12:25am
2
Hi,
It’s probably difficult for us to tune settings of ExtractMetadata without accessing the web site.
As workaround, how about modify the above data. We can easily move “ReducedAmount” data which doesn’t starts with “$” to “Beds” column, as the following, for example.
Sample20230425-1L.zip (3.0 KB)
Regards,
@Yoichi Thank you so much for your suggestion. Not sure if this helps, but here is the website I am trying to scrape.
Find Single Family Homes in New York from realtor.com®. Search 7,807 listings and filter New York apartments by price, beds, baths, property type. Choose from all types of apartments, including condos, townhomes, and houses for rent, including...
As you can see, I want to get the Price, Price down, and # of beds. It is a structured data so I should be able to just get the data without modifications?
Thanks again for your help!!
Yoichi
(Yoichi)
April 25, 2023, 2:22am
4
Hi,
Can you try the following sample?
<extract>
<row exact='1'>
<webctrl tag='section' idx='2' />
<webctrl tag='div' />
<webctrl tag='div' idx='1' />
<webctrl tag='div' idx='2' />
<webctrl tag='div' idx='2' />
</row>
<column exact='1' name='Column1' attr='text'>
<webctrl tag='section' idx='2' />
<webctrl tag='div' />
<webctrl tag='div' idx='1' />
<webctrl tag='div' idx='2' />
<webctrl tag='div' idx='2' />
<webctrl tag='div' class='price-wrapper' />
</column><column exact='1' name='Column2' attr='text'>
<webctrl tag='section' idx='2' />
<webctrl tag='div' />
<webctrl tag='div' idx='1' />
<webctrl tag='div' idx='2' />
<webctrl tag='div' idx='2' />
<webctrl tag='div' class='price-wrapper' />
<webctrl tag='div' class='card-reduced-amount' />
</column>
<column exact='1' name='Column3' attr='text'>
<webctrl tag='section' idx='2' />
<webctrl tag='div' />
<webctrl tag='div' idx='1' />
<webctrl tag='div' idx='2' />
<webctrl tag='div' idx='2' />
<webctrl tag='UL' />
<webctrl tag='LI' />
</column>
</extract>
Sample20230425-1L (2).zip (3.3 KB)
Regards,
Thank you @Yoichi
Here is what I got.
After trying different methods, what I ended up going with is:
<extract>
<row exact='1'>
<webctrl tag='section' idx='2' />
<webctrl tag='div' />
<webctrl tag='div' idx='1' />
<webctrl tag='div' idx='2' />
<webctrl tag='div' idx='2' />
</row>
<column exact='1' name='Column0' attr='fulltext'>
<webctrl tag='section' idx='2' />
<webctrl tag='div' />
<webctrl tag='div' idx='1' />
<webctrl tag='div' idx='2' />
<webctrl tag='div' idx='2' />
<webctrl tag='div' idx='2' />
<webctrl tag='div' idx='1' />
</column>
<column exact='1' name='Column1' attr='fulltext'>
<webctrl tag='section' idx='2' />
<webctrl tag='div' />
<webctrl tag='div' idx='1' />
<webctrl tag='div' idx='2' />
<webctrl tag='div' idx='2' />
<webctrl tag='div' idx='2' />
<webctrl tag='div' idx='2' />
<webctrl tag='div' idx='1' />
</column>
<column exact='1' name='Column2' attr='fulltext'>
<webctrl tag='section' idx='2' />
<webctrl tag='div' />
<webctrl tag='div' idx='1' />
<webctrl tag='div' idx='2' />
<webctrl tag='div' idx='2' />
<webctrl tag='ul' idx='1' />
<webctrl tag='li' text='bed' idx='1' />
<webctrl tag='span' idx='1' />
</column>
<column exact='1' name='Column3' attr='fulltext'>
<webctrl tag='section' idx='2' />
<webctrl tag='div' />
<webctrl tag='div' idx='1' />
<webctrl tag='div' idx='2' />
<webctrl tag='div' idx='2' />
<webctrl tag='ul' idx='1' />
<webctrl tag='li' text='bath' idx='1' />
<webctrl tag='span' idx='1' />
</column>
</extract>
This gave me the results. @Yoichi I could not have done without your help. I truly appreciate it!!!
1 Like
system
(system)
Closed
April 28, 2023, 10:24am
6
This topic was automatically closed 3 days after the last reply. New replies are no longer allowed.