Extract Table data from PDF

Input to my workflow is PDF documents from a folder that do not have a standard format. I need to extract order details which is in tabular format in the PDF. Apart from the tabular data the PDF will also contain paragraphs or customer information. I could identify the line where the tabular data starts by extracting line by line data from PDF by splitting the PDF content using environment.NewLine and by using string function.

Question here is how do we extract the tabular data? If i read using OCR, the data gets realignned without retaining the actual position which makes it difficult to split the fields. Since the position where the tabular data is present varies for each template, i need to pass the clipping region dynamically and extract structured data based on that. Appreciate your help on this with a simple example.

Hi,
If PDF is native then try with data scraping wizard.(works for tabular data)

Data scraping doesn’t work in my case.

Oh ok then…:roll_eyes:
All i think of is either by scraper or read pdf text but the both will return string output and then
You need to make use of indexing and substring to get each item and then pass to excel(optional).

PS: How about Generate Table activity : Generates a DataTable variable from unstructured data .
In CE 2017 edition it is integrated with scraper where user has the option to choose Column separator (space/tab/newline) and newline separator (space/tab/newline) and return the data table as the output.

1 Like

@ddpadil

hi ,

can you show me how to use generate table activity

Thanks in Advance

Give a try with CE 2017. :slight_smile:
For reference.
(right down the screen)
generate datatable
table

Thanks ddpadil

What if the tabular content is present in more than 1 page? and format is not standard?

Two Option
1.Use Read PDF activity and then you have choice to set the PDF page number.(extracting process remains same as mentioned in previous comment.)
2.Else you just have to use PDF shortcut keys (Ctrl+Shift+n or page down) by using SendHotKey Activity and perform Extracting.

This will not work for my scenario. Attaching the samples. The position where the tabular data is present will vary for each format.Sample.pdf (178.0 KB)
Sample1.pdf (181.6 KB)

Hi,

Which details you want to extract … like total,

hi @lissynikkytha,

can you try this,txt read from pdf.xaml (13.2 KB)

then, use replace activity to replace space into camma separator and save csv format… so the csv file contains structured data

what fields do you want to extract from the sample templates. Please list the exact fields.

I want the tabular data which has product code, product description, supplier id, Csct, Quantity, unit price and extended price to be extracted. If I replace space by comma separator, description with space in between will be difficult to handle

@ddpadil

Can you please explain in detail.

Did you get solution for this?
I’m trying the same thing, but while extracting the comma separated values from a table I am facing issues.

I have a table that looks like the below

When i try to extract lis as table ,it considers eacl line as table and extracts only one line .
Any help on this please …

Dear Team,
i am not able to see “Generate Table” button under “Screen Scrapping” wizard, instead, i am getting “Copy to Clipboard”.
i am using “CV 2019.4.2”

thanks in advance.

Hi @Pavan_Kodali

This functionality has been decoupled. You should now use the Generate Data Table activity. It is possible to paste your example input in its wizard and this is why you see an option to copy to clipboard in your Screen Scraping wizard :slight_smile:

1 Like

Hi @ddpadil

I have extracted data from PDF but i has been printed the whole data in 1st column … how to manipulate to diffferent columns.
Help Out!