Search the keywords in Multiple websites

Challenges:

  1. Search the keywords in billion websites. If found - return the website details. How to automate.
    2.We tried using selectors for 10 websites and the code is getting lengthy but our requirement is to work on 1000.
    3.We tried parsing “Send HOT keys” to find the keyword.But unable to get the exact result.

Additional Questions:

  1. How to construct the code for billionns of website. As each website has its own functionality.
  2. When BOT is trying to open the Millions browser website. Will it effect the crashing of UiPath/System.
  3. How to handle when the blockers comes.
  4. In case if website demands log in/sign up. How to tackle because its unpredictable.
  5. Check if the website is legitimate or not.

In this scenario, is it possible to find the keyword without using selectors. If yes, guide us. Let us know if its feasible using UiPath Studio

@Vaishnavi_RP,

I must say not a use case for RPA. Better to use Web Crawler for this which you can get easily.

Refer this with Python approach.

Thank you for your prompt response.

  1. Extract URLs, navigate to each URL, and search for a keyword – Is it possible to handle multiple websites using UiPath?

  2. Is there way to handle this without using selectors?

Is it possible to implement this use case using UiPath?

@Vaishnavi_RP

Addressing Your Challenges and Questions:

1. Searching Keywords Across Billions of Websites

  • Web Scraping with Search Engines: Instead of scraping billions of sites directly, leverage search engines like Google/Bing using their APIs (Google Custom Search API, Bing Search API) to retrieve relevant website links.
  • Parallel Processing: Use multiple bots to handle different website groups, distributing the load.
  • Cloud-Based Crawlers: Consider integrating services like Scrapy Cloud, Selenium Grid, or Google Colab for large-scale web crawling.

2. Handling 1000+ Websites with Selectors

  • Use Dynamic Selectors: Instead of hardcoding selectors, create adaptable selectors based on the structure of different websites.
  • Leverage Regex/XPath: Many websites follow common patterns. XPath and regular expressions can extract required data dynamically.
  • Use AI-Based Parsing: ML-based solutions like UiPath Document Understanding or OpenAI API can help generalize extraction logic.

3. Parsing HOT Keys to Find Keywords

  • OCR-Based Search: If UI elements are inaccessible, use OCR (UiPath’s Google Tesseract, Microsoft OCR) to extract visible text and search for keywords.
  • Full-Page Data Extraction: Extract page source using UiPath HTTP Requests, then process text-based searches.

4. Constructing Code for Billions of Websites

  • Scalable Architecture: Use a database to store already processed sites, avoiding redundant scrapes.
  • Categorization by Website Type: Group similar websites and use template-based scraping.
  • Hybrid Approach: Combine direct scraping for structured sites and API-based search for unstructured data.

5. System Crashing Due to Millions of Open Browsers

  • Use Headless Browsing: UiPath supports headless automation via Chrome/Firefox headless modes.
  • Limit Simultaneous Instances: Manage resource consumption by limiting concurrent browser instances.
  • Use Cloud Resources: Deploy bots on cloud VMs to distribute workload.

6. Handling Unexpected Blockers

  • Implement Retry Logic: Use “Retry Scope” in UiPath to handle temporary failures.
  • Use Proxy & Rotation: Many websites block bots; using rotating proxies (BrightData, ScraperAPI) can bypass restrictions.
  • Captcha Handling: Services like 2Captcha or Anti-Captcha can solve captchas automatically.

7. Handling Login/Signup

  • Credential Management: Store known credentials in UiPath Orchestrator’s Asset Vault.
  • AI-Based Form Filling: Use ML models to predict login/signup requirements dynamically.
  • Bypass Authentication: If allowed, use API endpoints instead of UI-based logins.

8. Checking Website Legitimacy

  • Domain Reputation Services: Use services like Google Safe Browsing API, VirusTotal, or Whois Lookup to verify sites.
  • Blacklists & Whitelists: Maintain a database of trusted domains and frequently encountered spam sites.

9. Finding Keywords Without Using Selectors

  • HTML Parsing with Regex/XPath: Extract text content from raw HTML (HTTP Request + Regex).
  • OCR-Based Keyword Search: For graphical interfaces, extract and search text with OCR.
  • Browser Extensions: Develop a Chrome extension that sends visible page text to UiPath.

Please let me know if you need more details.