I have invoices where I have to extract a table. The Problem here is the client doesn’t want to use API Key or OCR Engine. I tried generating the Data table using string manipulation but it didn’t give the desired result.
The table looks like this
Date Date Debit Credit Balance
xx xx xxxxxxxxx xxx xxxx
xx xx xxxxxxxxxxxxxx xxxx xxxxx
xxxxxx xx
xxxxxx xx
xx xx xxxxxxxxxxxxxx xx xxxxxxx
xx xx xxxxxxxxxx xx xxxxx
xxxxxxxxxxxx xx
Since it doesn’t have values in all columns it’s hard to extract the data.
We would need to understand whether the PDF used is Digital or Scanned. If Digital PDF, we could maybe use Regex /String Manipulation to get the required data after using the PDF Activities.
But if it is a Scanned PDF, we would require to use an OCR Engine for the Extraction.
One thing you can try is to read the pdf with preserve format. Then count the number of characters for each line and then extract data. This needs some time to be given on the extracted output but yes you would be able to segregate into table
This way for each column you will know the width and where it starts so that if any column is missing then you can ignore by counting the characters
If you check preserve format…then it will give the table in specific number of items…
say you have these 5 columns always but the data is empty in few. With preserve format you will find data like this
Date Date Debit Credit Balance
22.6.19 23.6.19 Nykka E-Shopping 3000 29000
Interest 30.68
Installment 60.50
30.9.19 30.9.19 NetBanking 200 19199.5
Then you can use the logic of number of columns…If you paste it in notepad you can see the start character number for each column you want
You have to find what is the maximum and minimum number of characters each column is taking when data is present and missing and build the logic with those numbers
Alternately you can try pasting this data using ctrl+v on excel that also might help you paste the data into different columns and eventually might give you final table you require
When I read pdf without preserving the format, it reads the whole pdf and when I preserve the format, it only reads partial pdf; I am not sure why it does that.
It is an invoice and its max 5 to 6 pages long and does not have any images.