Any recommend solution for fetching data from PDF?

opas1216 · October 24, 2019, 1:26am

Hello ,everyone

I’m sorry but i’d like to ask a stupid question, what kind of solution would you recommend when you need to fetch the data from PDF?

I already looking for a good way to solve this for a month.

At the present, I already tried the ways below:

Convert PDF to Excel(3rd party)
Result: Failed.
Reason: Not all the file can be converted to excel completely without missing data.
Read PDF Text(PDF Activities)
Result: Failed but might working on this to solve the problem.
Reason: It can read all the text from PDF, but how to separate it to the CSV by using “Generate Data Table” is a problem.
Although I ask for the user tried their best to create a rule-based file(The content of PDF file is invoice), they still can leave a lot of human mistakes like some space, etc.(It can’t be easily separate the text to CSV by separator [space])
OCR
Result: Failed
Reason: It’s useful but incorrect.

So I want to know does anyone has any experience to do these stuff and has a good solution?
Or any advise can be helpful

Thank you!

Shriharsha_H_N · October 24, 2019, 5:14am

Could you possible to share the pdf?

opas1216 · October 24, 2019, 6:33am

Hi @Shriharsha_H_N

Thank you for reply.

I’d like to share the PDF file if I could, sorry the content is confidential.

Is there any information you may need, maybe I can tell you

Ioana_Gligan · October 24, 2019, 9:44am

@opas1216

What data are you targeting and how many templates do you need to process (how many vendors that is)?

Have you tried the Machine Learning Extractor, if this is about invoices? It might fit your use case…

Ioana

wasea · October 25, 2019, 6:12pm

Hi @opas1216,

I am working now for a customer to extract the information from native pdf invoices.
they provide around 5 different invoices and I created 5 different templates.

My approach:

Read PDF Text
Use regex to replace the common data from invoices that you don’t need. (like supplier details or client details or whatever information you don’t need and is in big blocks )
Make the entire result to be on a single text line or remove all the spaces between words.
Use Regex to extract the information required based on specific patterns
Create a variable for each information you want to extract from the pdf file.
put the result in a data table, and export it as an excel file.

Results:

I run it usually for batches of 150 invoices, which is done in around 30 seconds. (the entire batch )
the result is an excel , with a 95% accuracy. The other 5% we agreed that will be done manually.

Vasile.

opas1216 · November 21, 2019, 2:17am

Hi @Ioana_Gligan

Thanks for your advice and Sorry for the late reply.

I tried your solution throw some old data on it, and it did work!
I must said that it’s really useful features but unfortunately those data are high confidential so I can’t just use the feature directly since the concern of data will deliver to the server.

I might try to start from Python whether can solve the problem or not.
After a period I’ll report if I have any achievement.

Thank you for your help.

opas1216 · November 21, 2019, 2:26am

Hi @wasea

Thanks for your help and sorry for the late reply.

Currently, I solve the problem with your suggestion.
Fortunately, my data wasn’t so complex to arrange it so that I can use regex to get the specific data that I wanted.

However, due to my team member had the same issues need to overcome, now I’m still working on it that it’s quite difficult to fetch the data.
The situation is when the data to be read from PDF to TEXT, there’re 2 information about “address” will be correct. But the 2 address will be read and combine together since they’re write on left part and right par of the same row on the PDF.

The problem now I faced is I can’t use the rule to separate the combined information of 2 address.
Do you have any idea how to solve this problem?

I’ll report the progress after any achievement.

Thank you for your help.

Ioana_Gligan · November 27, 2019, 6:51am

In the meantime we are working on providing the invoices machine learning extractor for on-prem usage, so soon to come

Ioana_Gligan · November 27, 2019, 6:53am

If the formats are pretty standard, how about trying to write a custom extractor? All info you need in in the Documentation, under the DocumentProcessing.Contracts package!

opas1216 · November 29, 2019, 5:53am

Hi @Ioana_Gligan
Thank you for your help, I’ll try your advice try to write a custom extractor, hopes it can solve our problem.
I also have tried the invoice/receipt extractor feature, it did perfectly for fetching all the information we need. However, since the information are high confidential we couldn’t just use your feature directly since the message will send to your server. We’re sorry but looking forward to use this function when your team officially release the new feature.

Thank you a lot !

Topic		Replies	Views
PDF Data Extraction in csv Activities pdf , activities , question	13	1361	June 23, 2021
Extract from PDF to Excel specifically Studio datatable , excel , selector , pdf , robot , activities , studio , question , activities_panel	7	753	April 12, 2023
Converting Pdf table to excel Activities excel , pdf , activities , studio	23	2343	January 18, 2023
Extracting pdf to excel Help excel , pdf , activities , question	5	1224	December 6, 2019
Brainstorming Solutions for Editing Data in PDFs Activities pdf	4	3566	February 28, 2021

Most Active Users - Yesterday
ashokkarale
MD_Farhan1
Ajay_Mishra
postwick
Dheerendra_vishwakarma
Anil_G
chandreshsinh.jadeja
Gautham_Pattabiraman
vrdabberu
aravindbalineni123
More details...

Any recommend solution for fetching data from PDF?

Related Topics