I want just basic table extraction, the web page has the next page button but I need to scroll for every page. Any idea how to do it?
Hi,
If we can click next button by Simulate or ChromiumAPI, it’s no problem even if the target exists outside screen (if it’s loaded) . Can you try this?
Regards,
Hello @Ali_Sir_Aydemir - Is the webpage accessible to the public? Wanted to check the nature of the table as some tables like https://datatables.net/ may not have all the data in the DOM (i.e. rows not visible get removed from the DOM as you scroll past them). Wanted to make sure this is not the issue you’re experiencing.
Thanks!
https://www.medifind.com/conditions/chronic-fatigue-syndrome/1135/doctors? You can check the website. Thank you
Thank you and It didn’t work. What I thought was a great idea is Do while(Scroll until Next Page Element is Visible then click next) but I got trouble setting the workflow.
You don’t have to scroll for Table Extraction to work.
please share with us the selector which you had configured for the next button. Use the </>
button from the editor for sharing it. Thanks
<webctrl aaname='Next' parentid='mf-root' tag='SPAN' type='' class='Button_label__4FHaL Button_normal__4rQXo' />
test with:
<webctrl aaname='Next' tag='SPAN' />
It works but just for 2 visible elements, there are more than 2 so, it kind of doesn’t work for me.
Works fine for the first page but didn’t click for the next page on the second page. (with simulate)
Tell us with which search criterias you had tested, so we can try to replicate it
Do you mean which data I am trying to scrape from that website? You can just try for Doctor Name.
we asked for to have the same scenario as you are testing. When using any other criteria we do have looked at a different set.
However, the next button is working. The problem is, that the Extract Data is more slow and grabs not a page complete. Therefore we do have less rows.
You can try if the out-of-the-box settings allows you to better sync.
Otherwise, you can remodel your approach and take more care about sync and extraction. Paging can be done with url trick as just the page number has to be increased
till you will get:
It got all 20 from the first page with no issues for me. I didn’t manually change anything. Just a few clicks in the Table Extraction wizard.
There does seem to be an issue where it doesn’t get the data from page 2, however, and I think that’s because the page doesn’t reload when you click Next - it just refreshes the list portion.
I will create a CSV file and open each link and try to scrape it. Thank you so much.
What i did when i had the same scenario are couple of simple solutions:
Send keyboard shortcut End inside the Use Browser activity, which scrolls to the bottom of the page which loads all the the neccessary data.
Also, for some other cases i also found it useful to use Inject JS Script activity and check if the page is loaded with a function, return 1 or 0 depending on whether the page is loaded or not before starting to scrape in a while loop until it loads.
I hope this helps
The issue is that the page doesn’t reload when Next is clicked, which I think is preventing the extract activity from realizing a new page of data is available. So you’ll have to do something like this where you just have Extract do one page, merge it into a final datatable, then check if next exists and click it…
So, I create a script to generate links, then I opened every page and scroll down to the end of the page and wait 3 seconds and extract data. There were multiple columns name for every page and NBSP, for that:
import pandas as pd
# Read the CSV file into a DataFrame
df = pd.read_csv('doctor_info.csv')
# Remove unwanted characters from the 'Adress_Line_1' and 'Adress_Line_2' columns
df['Adress_Line_1'] = df['Adress_Line_1'].str.replace('\xa0', '')
df['Adress_Line_2'] = df['Adress_Line_2'].str.replace('\xa0', '')
values_to_remove = ['Name', 'Url', 'Adress_Line_1', 'Adress_Line_2']
df = df[~df.isin(values_to_remove)].dropna()
# Save the modified DataFrame to a new CSV file
df.to_csv('filtered_doctor_info.csv', index=False)
edit: thank you!
Totally agree with your solution but what works for me is to save all HTML pages and deal in Python :D. But for this scenario, I did something similar. Thank you!