Extract Table data from PDF

datatable
studio

#1

Input to my workflow is PDF documents from a folder that do not have a standard format. I need to extract order details which is in tabular format in the PDF. Apart from the tabular data the PDF will also contain paragraphs or customer information. I could identify the line where the tabular data starts by extracting line by line data from PDF by splitting the PDF content using environment.NewLine and by using string function.

Question here is how do we extract the tabular data? If i read using OCR, the data gets realignned without retaining the actual position which makes it difficult to split the fields. Since the position where the tabular data is present varies for each template, i need to pass the clipping region dynamically and extract structured data based on that. Appreciate your help on this with a simple example.


#2

Hi,
If PDF is native then try with data scraping wizard.(works for tabular data)


#3

Data scraping doesn’t work in my case.


#4

Oh ok then…:roll_eyes:
All i think of is either by scraper or read pdf text but the both will return string output and then
You need to make use of indexing and substring to get each item and then pass to excel(optional).

PS: How about Generate Table activity : Generates a DataTable variable from unstructured data .
In CE 2017 edition it is integrated with scraper where user has the option to choose Column separator (space/tab/newline) and newline separator (space/tab/newline) and return the data table as the output.


How to exctract unstructured PDF data to excel table format
#5

@ddpadil

hi ,

can you show me how to use generate table activity

Thanks in Advance


#6

Give a try with CE 2017. :slight_smile:
For reference.
(right down the screen)
generate datatable
table


#7

Thanks ddpadil


#8

What if the tabular content is present in more than 1 page? and format is not standard?


#9

Two Option
1.Use Read PDF activity and then you have choice to set the PDF page number.(extracting process remains same as mentioned in previous comment.)
2.Else you just have to use PDF shortcut keys (Ctrl+Shift+n or page down) by using SendHotKey Activity and perform Extracting.


#10

This will not work for my scenario. Attaching the samples. The position where the tabular data is present will vary for each format.Sample.pdf (178.0 KB)
Sample1.pdf (181.6 KB)


#11

Hi,

Which details you want to extract … like total,


#12

hi @lissynikkytha,

can you try this,txt read from pdf.xaml (13.2 KB)

then, use replace activity to replace space into camma separator and save csv format… so the csv file contains structured data


#13

what fields do you want to extract from the sample templates. Please list the exact fields.


How to extract data from table which is in pdf format?
#14

I want the tabular data which has product code, product description, supplier id, Csct, Quantity, unit price and extended price to be extracted. If I replace space by comma separator, description with space in between will be difficult to handle


#15

@ddpadil

Can you please explain in detail.