Any recommend solution for fetching data from PDF?

Hello ,everyone

I’m sorry but i’d like to ask a stupid question, what kind of solution would you recommend when you need to fetch the data from PDF?

I already looking for a good way to solve this for a month.

At the present, I already tried the ways below:

  1. Convert PDF to Excel(3rd party)
    Result: Failed.
    Reason: Not all the file can be converted to excel completely without missing data.

  2. Read PDF Text(PDF Activities)
    Result: Failed but might working on this to solve the problem.
    Reason: It can read all the text from PDF, but how to separate it to the CSV by using “Generate Data Table” is a problem.
    Although I ask for the user tried their best to create a rule-based file(The content of PDF file is invoice), they still can leave a lot of human mistakes like some space, etc.(It can’t be easily separate the text to CSV by separator [space])

  3. OCR
    Result: Failed
    Reason: It’s useful but incorrect.

So I want to know does anyone has any experience to do these stuff and has a good solution?
Or any advise can be helpful

Thank you!

1 Like

Could you possible to share the pdf?

Hi @Shriharsha_H_N

Thank you for reply.

I’d like to share the PDF file if I could, sorry the content is confidential.

Is there any information you may need, maybe I can tell you

@opas1216

What data are you targeting and how many templates do you need to process (how many vendors that is)?

Have you tried the Machine Learning Extractor, if this is about invoices? It might fit your use case…

Ioana

1 Like

Hi @opas1216,

I am working now for a customer to extract the information from native pdf invoices.
they provide around 5 different invoices and I created 5 different templates.

My approach:

  1. Read PDF Text
  2. Use regex to replace the common data from invoices that you don’t need. (like supplier details or client details or whatever information you don’t need and is in big blocks )
  3. Make the entire result to be on a single text line or remove all the spaces between words.
  4. Use Regex to extract the information required based on specific patterns
  5. Create a variable for each information you want to extract from the pdf file.
  6. put the result in a data table, and export it as an excel file.

Results:

  1. I run it usually for batches of 150 invoices, which is done in around 30 seconds. (the entire batch :smiley: )
  2. the result is an excel , with a 95% accuracy. The other 5% we agreed that will be done manually.

Vasile.

2 Likes

Hi @Ioana_Gligan

Thanks for your advice and Sorry for the late reply.

I tried your solution throw some old data on it, and it did work!
I must said that it’s really useful features but unfortunately those data are high confidential so I can’t just use the feature directly since the concern of data will deliver to the server.

I might try to start from Python whether can solve the problem or not.
After a period I’ll report if I have any achievement.

Thank you for your help. :slight_smile:

1 Like

Hi @wasea

Thanks for your help and sorry for the late reply.

Currently, I solve the problem with your suggestion.
Fortunately, my data wasn’t so complex to arrange it so that I can use regex to get the specific data that I wanted.

However, due to my team member had the same issues need to overcome, now I’m still working on it that it’s quite difficult to fetch the data.
The situation is when the data to be read from PDF to TEXT, there’re 2 information about “address” will be correct. But the 2 address will be read and combine together since they’re write on left part and right par of the same row on the PDF.

The problem now I faced is I can’t use the rule to separate the combined information of 2 address.
Do you have any idea how to solve this problem?

I’ll report the progress after any achievement.

Thank you for your help. :slight_smile:

In the meantime we are working on providing the invoices machine learning extractor for on-prem usage, so soon to come :slight_smile:

1 Like

If the formats are pretty standard, how about trying to write a custom extractor? All info you need in in the Documentation, under the DocumentProcessing.Contracts package!

1 Like

Hi @Ioana_Gligan
Thank you for your help, I’ll try your advice try to write a custom extractor, hopes it can solve our problem.
I also have tried the invoice/receipt extractor feature, it did perfectly for fetching all the information we need. However, since the information are high confidential we couldn’t just use your feature directly since the message will send to your server. We’re sorry but looking forward to use this function when your team officially release the new feature.

Thank you a lot !

1 Like