Repeated Data Tables Scraped From A Multi-Page PDF

datatable
pdf
datascraping

#1

Hi everyone,

Have been learning and developing a proof of concept with UI Path to measure its feasibility for a particular use case.

The program does the following sucessfully

  • Starts an Adobe Reader process with the open file parameter pointing to a PDF
  • Attaches the Adobe Reader window
  • Gets the number of pages for the document by looking at the 1/10 pages element within the program
  • Builds a data table to push in data to
  • While a counter is less than the page length
    - Send the page number (counter) text to the page textbox to ensure that its in view and can be scraped
    - Extract a structured data table dynamically using the counter page number

FORMAT:

Extract Meta Data

<extract-table get_columns_name='1' get_empty_columns='1' />
Target Selector
<wnd cls='AVL_AVView' title='AVPageView' /><ctrl idx='" & counter.ToString & "' role='table' />
 - Initialize row and column indexes and variables
  • For each row in the extracted data table from that page
  • For each item in that row
  • If that cells matches some conditions
  • Either save that cells value or another cell delineated by row and column indexes based on its relative position
  • Push a data row to the table of new values
  • Eventually write a CSV from the new data table

The problem is the following:

The data coming out into CSV format is perfect for the first 3 pages.

|id_no|inventory|
|1 |1’s inventory correct|
|2 |2’s inventory correct|
|3 |3s inventory correct|
|4 |4’s inventory correct|
|4|4s inventory correct|
|4 |4’s inventory correct|
|4 |4’s inventory correct|
|4 |4’s inventory correct|
|4 |4’s inventory correct|

However, it does not seem to scrape any new pages after page 4.

Something interesting I see and that could be related to the source of the problem is that after Adobe Acrobat initializes and senses UI Path reading elements on the screen, it prepares the loaded document for screen reading (i havent been able to highlight elements within the pdf without this happening)

I notice that it always starts loading on page 4 as if the first three pages are already in memory, Just a thought, I have cycled through all the pages, tried a delay to ensure those pages arrive in memory.

Can’t seem to figure out why the pages after 4 are returning the same data table as 4.


#2

Scarily similar issue here: Duplicate scrap data search results

Their scraping also started repeating on the 4th iteration, looking at the solution provided there right now, if anyone can help that would be awesome :slight_smile:


#3

Haha, after studying the scraper action itself I saw the useful configuration values of Max data and table delay, set both to a 1000 and now it is working perfectly save for one value which I may need to apply more logic too


#4

@SarimQ, would you care sharing that workflow you did? I’m looking for something utterly similar and am stumbling at a similar stage.

Many thanks,
Stefan