How to extract table from PDF Without using API key?

sunilkanth · November 24, 2022, 12:48pm

I have invoices where I have to extract a table. The Problem here is the client doesn’t want to use API Key or OCR Engine. I tried generating the Data table using string manipulation but it didn’t give the desired result.

The table looks like this

Date Date Debit Credit Balance
xx xx xxxxxxxxx xxx xxxx
xx xx xxxxxxxxxxxxxx xxxx xxxxx
xxxxxx xx
xxxxxx xx
xx xx xxxxxxxxxxxxxx xx xxxxxxx
xx xx xxxxxxxxxx xx xxxxx
xxxxxxxxxxxx xx

Since it doesn’t have values in all columns it’s hard to extract the data.

supermanPunch · November 24, 2022, 12:51pm

Hi @sunilkanth ,

We would need to understand whether the PDF used is Digital or Scanned. If Digital PDF, we could maybe use Regex /String Manipulation to get the required data after using the PDF Activities.

But if it is a Scanned PDF, we would require to use an OCR Engine for the Extraction.

sunilkanth · November 24, 2022, 12:55pm

It is a digital PDF.

I don’t have attach image option here. I can show you a sample table if I have that.

supermanPunch · November 24, 2022, 1:07pm

@sunilkanth ,

Check if the approach in the below post is suitable for your case :
https://forum.uipath.com/t/how-to-extract-dat-from-pdf-file-row-items-might-be-more-in-many-pages-also-i-have-to-extract-one-by-one-storing-them-in-a-excel/485321/4

Anil_G · November 24, 2022, 1:11pm

Hi @sunilkanth

One thing you can try is to read the pdf with preserve format. Then count the number of characters for each line and then extract data. This needs some time to be given on the extracted output but yes you would be able to segregate into table

This way for each column you will know the width and where it starts so that if any column is missing then you can ignore by counting the characters

cheers

sunilkanth · November 25, 2022, 10:07am

I have like more than 50,000 files for one batch,won’t converting these many files delay the run time?

sunilkanth · November 25, 2022, 10:18am

I didn’t quite understand your logic. The number of charc is not same for all the lines since its an invoice it varies from customer to customer.
ex:

Date Date Debit Credit Balance
22.6.19 23.6.19 Nykka E-Shopping 3000 29000
27.6.19 27.6.19 Mobile pay 25,000.50 3999.5
29.9.19 30.9.19 Credit Card34xxxxxx789 15,000.00 18999.5
Interest 30.68
Installment 60.50
30.9.19 30.9.19 NetBanking 200 19199.5

Anil_G · November 25, 2022, 10:24am

Hi @sunilkanth

If you check preserve format…then it will give the table in specific number of items…
say you have these 5 columns always but the data is empty in few. With preserve format you will find data like this

Date           Date                  Debit              Credit                  Balance
22.6.19       23.6.19              Nykka E-Shopping       3000                   29000
                                   Interest               30.68
                                   Installment                                   60.50
30.9.19       30.9.19              NetBanking             200                     19199.5

Then you can use the logic of number of columns…If you paste it in notepad you can see the start character number for each column you want

You have to find what is the maximum and minimum number of characters each column is taking when data is present and missing and build the logic with those numbers

Alternately you can try pasting this data using ctrl+v on excel that also might help you paste the data into different columns and eventually might give you final table you require

cheers

sunilkanth · November 25, 2022, 10:30am

Thanks for the detailed info. I will try this and update the result

sunilkanth · November 28, 2022, 7:40am

If I use preserve format, it does not read the whole pdf. Do you know why is that happening?

Anil_G · November 28, 2022, 8:37am

Hi @sunilkanth

Try using read pdf with ocr may be that might help…

Ideally if you are getting whole data from read pdf text you should get with preserve format as well.

Is that a form pdf or is it having any images or multiple color?

You can as well try converting the pdf to excel or word to get the data as an alternative

When you say it does not read is it missing some columns or reading only few pages?

Cheers

sunilkanth · November 28, 2022, 2:00pm

When I read pdf without preserving the format, it reads the whole pdf and when I preserve the format, it only reads partial pdf; I am not sure why it does that.

It is an invoice and its max 5 to 6 pages long and does not have any images.

Anil_G · November 28, 2022, 2:05pm

Hi @sunilkanth

Did you try using individual pdf’s to read. Even then it is truncating?

Or when you are verifying if in locals panel are you seeing partial data?

Is there some missing information or the pdf is read, only for fewpages?

cheers

Topic		Replies	Views
PDF tabular data extraction Studio	3	813	February 24, 2021
Extracting table from PDF and splitting row by column Studio studio , question , properties_panel	18	4393	April 20, 2022
Extract values in PDF Studio	8	1318	June 16, 2023
Extract tabular data from Read-Only PDF Help	5	5926	April 26, 2017
How to Extract tabel data from pdf file Help studio , question	3	706	March 1, 2021

How to extract table from PDF Without using API key?

Related topics