• Read any 15 fields in the pdf file (pdf contains some fields and fields contains data
ex. Address:
XYZ, Singapore)
• Store the value into destination table in database
• Keep the value into source table in database
• After read pdf and store into destination table, program compare value with source table against few fields
I have done till now :
i am using read pdf with ocr then in it i have taken microsoft ocr engine.
i have taken write text file to get the content from pdf file in text file and created 1 variable for it
then i have taken read text file to read the content and passed that variable in it.
Issues i am facing are :
File is not being read properly row by row and also after reading pdf it is converting some letters into
special characters
In Few Fields data is not coming sequentially in text file after reading pdf
So while extracting data from text file is getting problem, data is not properly extracted in that case.
So, Please anyone can suggest me the way how to read and extract the data from pdf ???
As per your current use case, if you are using OCR you might not get 100% accuracy, the results varies and this is due to limitations of OCR.
File is not being read properly row by row and also after reading pdf it is converting some letters into special characters
You will get the entire file into a string and then you can split it on the basis of system.environment.newline and store it in an array and then read the array line by line. Here again due to document quality and OCR limitations the execution might not give 100% accurate results so you need to play with scale and different OCR engines and ensure a decent(good) quality pdf to be read.
2)In Few Fields data is not coming sequentially in text file after reading pdf
Data may not come sequentially but there will be some pattern which you can identify and then extract the data out of it, for instance, If you want to extract Invoice Number however the Invoice Number is in second line and after that you are getting “Date” in that case you need to first find the index of “Invoice Number” and then extract data between “Invoice Number” and “Date”.
So while extracting data from text file is getting problem, data is not properly extracted in that case.
Are you using Read text File activity or is it via OCR?
If Read text file, please gives us a sample file and we will test if there are any issues.
If via OCR, then point 1 holds the same for this.
Thanks for all your suggetions regarding my issues.
As Data is not coming sequentially, so to extract the data from that fields i am following some pattern. like you said i am first taking the indexof"that perticular field" then finding the starting position of data after that finding out the ending position of it then i am calculating the final count of string from it.
But my point is that if i am using other any pdf instead of using this pdf then how can this logic will work for that ofcourse its fields index, starting position, ending position of data will be completely different so how this pattern will be implemented for that.???
Yes it might not work and hence you need to first create a sampling and find the patterns for all the fields.
Once you have identified the pattern, you can design it accordingly. There will be tweaks as per the pdf’s too which you need to handle inside your workflow. e.g. In one PDF heading of Invoice Number is "Invoice Num : " and in another it can be "Inv No: " etc
But in few fields of text file half data in field is coming then 2nd field data is merge into it after that remaining half data comes that’s the big problem.
i want to extract only tables from my pdf . My pdf consists of 4 pages at the beginning there are some text written on the first page after that there is only tables. how can i extract only tables from it?
But in my scope of work i need to extract data from different PDF files.
issue i am facing is:
Some PDF files output is good with native scraping and others require to use OCR. Is there any way that i can build a BOT which has to take any PDF in folder as input and decide the better way of reading it(native or OCR) by its self.