How to extract table from PDF Without using API key?

I have invoices where I have to extract a table. The Problem here is the client doesn’t want to use API Key or OCR Engine. I tried generating the Data table using string manipulation but it didn’t give the desired result.

The table looks like this

Date Date Debit Credit Balance
xx xx xxxxxxxxx xxx xxxx
xx xx xxxxxxxxxxxxxx xxxx xxxxx
xxxxxx xx
xxxxxx xx
xx xx xxxxxxxxxxxxxx xx xxxxxxx
xx xx xxxxxxxxxx xx xxxxx
xxxxxxxxxxxx xx

Since it doesn’t have values in all columns it’s hard to extract the data.

Hi @sunilkanth ,

We would need to understand whether the PDF used is Digital or Scanned. If Digital PDF, we could maybe use Regex /String Manipulation to get the required data after using the PDF Activities.

But if it is a Scanned PDF, we would require to use an OCR Engine for the Extraction.

It is a digital PDF.

I don’t have attach image option here. I can show you a sample table if I have that.

@sunilkanth ,

Check if the approach in the below post is suitable for your case :

Hi @sunilkanth

One thing you can try is to read the pdf with preserve format. Then count the number of characters for each line and then extract data. This needs some time to be given on the extracted output but yes you would be able to segregate into table

This way for each column you will know the width and where it starts so that if any column is missing then you can ignore by counting the characters

cheers

I have like more than 50,000 files for one batch,won’t converting these many files delay the run time?

I didn’t quite understand your logic. The number of charc is not same for all the lines since its an invoice it varies from customer to customer.
ex:

Date Date Debit Credit Balance
22.6.19 23.6.19 Nykka E-Shopping 3000 29000
27.6.19 27.6.19 Mobile pay 25,000.50 3999.5
29.9.19 30.9.19 Credit Card34xxxxxx789 15,000.00 18999.5
Interest 30.68
Installment 60.50
30.9.19 30.9.19 NetBanking 200 19199.5

Hi @sunilkanth

If you check preserve format…then it will give the table in specific number of items…
say you have these 5 columns always but the data is empty in few. With preserve format you will find data like this

Date           Date                  Debit              Credit                  Balance
22.6.19       23.6.19              Nykka E-Shopping       3000                   29000
                                   Interest               30.68
                                   Installment                                   60.50
30.9.19       30.9.19              NetBanking             200                     19199.5

Then you can use the logic of number of columns…If you paste it in notepad you can see the start character number for each column you want

You have to find what is the maximum and minimum number of characters each column is taking when data is present and missing and build the logic with those numbers

Alternately you can try pasting this data using ctrl+v on excel that also might help you paste the data into different columns and eventually might give you final table you require

cheers

Thanks for the detailed info. I will try this and update the result

1 Like

If I use preserve format, it does not read the whole pdf. Do you know why is that happening?

Hi @sunilkanth

Try using read pdf with ocr may be that might help…

Ideally if you are getting whole data from read pdf text you should get with preserve format as well.

Is that a form pdf or is it having any images or multiple color?

You can as well try converting the pdf to excel or word to get the data as an alternative

When you say it does not read is it missing some columns or reading only few pages?

Cheers

When I read pdf without preserving the format, it reads the whole pdf and when I preserve the format, it only reads partial pdf; I am not sure why it does that.

It is an invoice and its max 5 to 6 pages long and does not have any images.

Hi @sunilkanth

Did you try using individual pdf’s to read. Even then it is truncating?

Or when you are verifying if in locals panel are you seeing partial data?

Is there some missing information or the pdf is read, only for fewpages?

cheers