Extract Table data from PDF

lissynikkytha · September 6, 2017, 7:05am

Input to my workflow is PDF documents from a folder that do not have a standard format. I need to extract order details which is in tabular format in the PDF. Apart from the tabular data the PDF will also contain paragraphs or customer information. I could identify the line where the tabular data starts by extracting line by line data from PDF by splitting the PDF content using environment.NewLine and by using string function.

Question here is how do we extract the tabular data? If i read using OCR, the data gets realignned without retaining the actual position which makes it difficult to split the fields. Since the position where the tabular data is present varies for each template, i need to pass the clipping region dynamically and extract structured data based on that. Appreciate your help on this with a simple example.

ddpadil · September 6, 2017, 8:46am

Hi,
If PDF is native then try with data scraping wizard.(works for tabular data)

lissynikkytha · September 6, 2017, 9:20am

Data scraping doesn’t work in my case.

ddpadil · September 6, 2017, 9:31am

Oh ok then…
All i think of is either by scraper or read pdf text but the both will return string output and then
You need to make use of indexing and substring to get each item and then pass to excel(optional).

PS: How about Generate Table activity : Generates a DataTable variable from unstructured data .
In CE 2017 edition it is integrated with scraper where user has the option to choose Column separator (space/tab/newline) and newline separator (space/tab/newline) and return the data table as the output.

Ajithkumar_P · September 6, 2017, 9:56am

@ddpadil

hi ,

can you show me how to use generate table activity

Thanks in Advance

ddpadil · September 6, 2017, 10:01am

Give a try with CE 2017.
For reference.
(right down the screen)
generate datatable
table

Ajithkumar_P · September 6, 2017, 10:02am

Thanks ddpadil

lissynikkytha · September 12, 2017, 5:53am

What if the tabular content is present in more than 1 page? and format is not standard?

ddpadil · September 12, 2017, 1:33pm

Two Option
1.Use Read PDF activity and then you have choice to set the PDF page number.(extracting process remains same as mentioned in previous comment.)
2.Else you just have to use PDF shortcut keys (Ctrl+Shift+n or page down) by using SendHotKey Activity and perform Extracting.

lissynikkytha · September 13, 2017, 3:47am

This will not work for my scenario. Attaching the samples. The position where the tabular data is present will vary for each format.Sample.pdf (178.0 KB)
Sample1.pdf (181.6 KB)

Ajithkumar_P · September 14, 2017, 4:10am

Hi,

Which details you want to extract … like total,

Ajithkumar_P · September 14, 2017, 4:22am

hi @lissynikkytha,

can you try this,txt read from pdf.xaml (13.2 KB)

then, use replace activity to replace space into camma separator and save csv format… so the csv file contains structured data

jmf · February 13, 2018, 11:22pm

what fields do you want to extract from the sample templates. Please list the exact fields.

lissynikkytha · February 14, 2018, 4:23am

I want the tabular data which has product code, product description, supplier id, Csct, Quantity, unit price and extended price to be extracted. If I replace space by comma separator, description with space in between will be difficult to handle

prathapr · June 1, 2018, 1:50pm

@ddpadil

Can you please explain in detail.

jaiswalvivek91 · October 8, 2018, 5:27am

Did you get solution for this?
I’m trying the same thing, but while extracting the comma separated values from a table I am facing issues.

rparvat4 · October 15, 2018, 7:22am

I have a table that looks like the below

When i try to extract lis as table ,it considers eacl line as table and extracts only one line .
Any help on this please …

Pavan_Kodali · May 6, 2019, 9:09am

Dear Team,
i am not able to see “Generate Table” button under “Screen Scrapping” wizard, instead, i am getting “Copy to Clipboard”.
i am using “CV 2019.4.2”

thanks in advance.

loginerror · May 6, 2019, 2:27pm

Hi @Pavan_Kodali

This functionality has been decoupled. You should now use the Generate Data Table activity. It is possible to paste your example input in its wizard and this is why you see an option to copy to clipboard in your Screen Scraping wizard

Janga_Shiva_Raj · August 29, 2019, 11:43am

Hi @ddpadil

I have extracted data from PDF but i has been printed the whole data in 1st column … how to manipulate to diffferent columns.
Help Out!

Topic		Replies	Views
Extract tabular data from PDF Help pdf , activities , data_scraping , question , data_manipulation	7	1644	December 14, 2019
Extract data from PDF(vertically) Help excel , uiautomation , pdf , studio	14	1958	October 10, 2019
PDF table extraction in excel/datatable Studio studio , question , properties_panel	4	2047	June 9, 2021
Extract PDF tabular data Studio datatable , excel , pdf , activities , data_scraping	10	1927	February 24, 2020
Tabular data extraction from pdf to excel Studio excel , pdf	16	2691	March 5, 2021

Extract Table data from PDF

Related topics