Extract table data from multiple pages of pdf

pdf
scraping
studio

#1

Hi,

I am trying to extract data from a table spanning multiple pages of a pdf file. I can’t use read pdf as the table has empty cells and read pdf will misalign the columns.

Any pointers?

Cheers,
Aman


#2

In 2016.2 You can use Extract Structured Data from recording to extract data from pdf tables, have you tried that?


How to split and extract from huge PDF file
#3

Yes, I have tried that and it can extract the data from one page. But it is unable to extract data from 2nd page onwards. On web there is an option to select next page, but while reading pdf that option is missing.


Converting Multi Page Bank PDF (Bank Statement) into Excel File
#4

tables will be having different idx, increment the same and loop until it exists and extract to data table


#5

This works. Thanks for the help.


#6

I don’t understand how to do that…can you please explain?


#7

Can U Please explain with a workflow .


#8

Could you please explain with example?


#9

Get selector for one of the table and check its idx value, then again get the selector of the next table and check its idx value, this will help you to figure out the selector with variable idx value and fetch its value.


#10

Thanks alot


#11

Hi,
I am trying to read tabular data from a PDF(native) file which spans through multiple pages.
I tried read PDF text but the string is lengthy and very difficult to parse all the outputs. I am able to use data scraping for each page by changing the index in selector but the structure is not preserved. Can you please try and suggest me a solution for this?Acrobat Document.pdf (521.0 KB)


#12

Down below is the selector whose ctrl idx value needs to be incremented till it exists. You can use a variable in place of 55 below which needs to be incremented till the selector exists

<wnd app='acrord32.exe' cls='AcrobatSDIWindow' title='Acrobat Document.pdf - Adobe Reader' />
<wnd cls='AVL_AVView' title='AVPageView' />
<ctrl idx='55' role='row' />

"<wnd app='acrord32.exe' cls='AcrobatSDIWindow' title='Acrobat Document.pdf - Adobe Reader' />
    <wnd cls='AVL_AVView' title='AVPageView' />
    <ctrl idx='" + CounterValue + "' role='row' />"

#13

Hi Vinay,
I am trying but it is not working. Can you please share the .xaml?
Which field you are getting using below selector?


#14

Suppose for example declare an Integer variable for counter(counter), use second variable(rowSelector) which should be of string type to assign the selector as above, and set the variable rowSelector in place of the selector property

Initialize counter
Start loop
assign rowSelector
(This should update the rowSelector in each loop resulting in new selector for each new row found in PDF

"<wnd app='acrord32.exe' cls='AcrobatSDIWindow' title='*.pdf - Adobe Reader' />
<wnd cls='AVL_AVView' title='AVPageView' />
<ctrl idx='" + counter + "' role='row' />"

)
Check if selector exists / present
Fetch value using the rowSelector if the selector exists
Increment loop
End loop


#15

I have a same question. And I tried with below advices, but I failed.
How about the result? can you share it?
Thanks very much.

REGARDS


#16

Hi Aman,

I am trying to extract tables from pdf but i am not able to do so. Could you please help me in this?
I have tried both screen and data scraping method as well.


#17

Hi Vinay,

I tried with your approach…code is running but data in not coming into CSV file. for single page it loading data.

I am a new user so not allowed to upload .XAML file.

Kindly suggest .

Regards,
Akhilesh