Repeated Data Tables Scraped From A Multi-Page PDF

SarimQ · March 27, 2018, 3:37pm

Hi everyone,

Have been learning and developing a proof of concept with UI Path to measure its feasibility for a particular use case.

The program does the following sucessfully

Starts an Adobe Reader process with the open file parameter pointing to a PDF
Attaches the Adobe Reader window
Gets the number of pages for the document by looking at the 1/10 pages element within the program
Builds a data table to push in data to
While a counter is less than the page length
- Send the page number (counter) text to the page textbox to ensure that its in view and can be scraped
- Extract a structured data table dynamically using the counter page number

FORMAT:

Extract Meta Data

<extract-table get_columns_name='1' get_empty_columns='1' />
Target Selector
<wnd cls='AVL_AVView' title='AVPageView' /><ctrl idx='" & counter.ToString & "' role='table' />

 - Initialize row and column indexes and variables

For each row in the extracted data table from that page
For each item in that row
If that cells matches some conditions
Either save that cells value or another cell delineated by row and column indexes based on its relative position
Push a data row to the table of new values
Eventually write a CSV from the new data table

The problem is the following:

The data coming out into CSV format is perfect for the first 3 pages.

|id_no|inventory|
|1 |1’s inventory correct|
|2 |2’s inventory correct|
|3 |3s inventory correct|
|4 |4’s inventory correct|
|4|4s inventory correct|
|4 |4’s inventory correct|
|4 |4’s inventory correct|
|4 |4’s inventory correct|
|4 |4’s inventory correct|

However, it does not seem to scrape any new pages after page 4.

Something interesting I see and that could be related to the source of the problem is that after Adobe Acrobat initializes and senses UI Path reading elements on the screen, it prepares the loaded document for screen reading (i havent been able to highlight elements within the pdf without this happening)

I notice that it always starts loading on page 4 as if the first three pages are already in memory, Just a thought, I have cycled through all the pages, tried a delay to ensure those pages arrive in memory.

Can’t seem to figure out why the pages after 4 are returning the same data table as 4.

SarimQ · March 27, 2018, 3:46pm

Scarily similar issue here: Duplicate scrap data search results

Their scraping also started repeating on the 4th iteration, looking at the solution provided there right now, if anyone can help that would be awesome

SarimQ · March 27, 2018, 4:05pm

Haha, after studying the scraper action itself I saw the useful configuration values of Max data and table delay, set both to a 1000 and now it is working perfectly save for one value which I may need to apply more logic too

StefanBebie · July 6, 2018, 2:21pm

@SarimQ, would you care sharing that workflow you did? I’m looking for something utterly similar and am stumbling at a similar stage.

Many thanks,
Stefan

sgwong · October 30, 2019, 8:59am

Could you share with us on how to get solution working?

Topic		Replies	Views
Extract table data from multiple pages of pdf Help pdf , studio , data_scraping	17	15800	October 29, 2018
How to extract Multiple datatables from a PDF which contains multiple pages (Max 3 pages) AI Center question , document_understanding , ai_center , pdf-extraction	9	69	October 10, 2024
How do we scrap a pdf that has a table which extends to the next page? Studio datatable , excel , selector , studio , data_scraping , question	2	1423	February 28, 2021
Extract Data Tables in PDF with Multiple Pages StudioX datatable , excel , studiox , question , uipath	1	2298	September 30, 2021
Extract table data from PDF to csv Help	6	4493	July 13, 2018

Most Active Users - Yesterday
Anil_G
ashokkarale
CHEN-CC
ahsan_khan
rkelchuri
Mayra_Alejandra_Paredes_C
lulachee
More details...

Repeated Data Tables Scraped From A Multi-Page PDF

Related topics