Extract Table from pdf using Data Scraping

Hi there,

I am trying to extract table from a pdf (screenshot below):

I have used Data Scraping extraction wizard. However, I am unable to extract column name correctly. Following is the output:

Blockquote

[Column-0,Participants,Ballots Completed,Ballots Incomplete/ Terminated,Results,Column-5
Blind

,5

,1

,4

,"34.5%, n=1

","1199 sec, n=1

"
Low Vision

,5

,2

,3

,"98.3% n=2

(97.7%, n=3)

","1716 sec, n=3

(1934 sec, n=2)

"
Dexterity

,5

,4

,1

,"98.3%, n=4

","1672.1 sec, n=4

"
Mobility

,3

,3

,0

,"95.4%, n=3

","1416 sec, n=3

"
]

Blockquote

Question: How can I improve the table extraction to get correct column names?

Cheers :slight_smile:

Below is the screenshot of extraction wizard:

Here’s how my code looks like:
image

Pdf file used for this exercise can be downloaded from https://www.w3.org/WAI/WCAG20/Techniques/working-examples/PDF20/table.pdf

@husain.shah Are you using SilverLight Extension ?

No. I am not using silverlight ext.

@husain.shah Is the PDF file stored in your System?

Yes. i downloaded the pdf from the source above and working on a local copy.

@husain.shah Then Have you tried PDFtoExcel Activity ?

PDFtoExcel Activity uses SautinSoft api which has a trial version that only converts 3 pages of PDF and it is for evaluation purposes only. I am interested in a free solution.

Hi Hussain,
Were you able to find the solution?

hey did you find any free and viable solutions to extract data table from pdf?

Hello shero,
Yes, I tried epsilon package for the same. However, it is not the best solution but definitely worth a try and it is free of cost.

Link below -

It did not work accurately for me.
I want to extract tabular data row wise based some regex.