I am unable to read and extract data from pdf file

My Scope of Work Is

• Read any 15 fields in the pdf file (pdf contains some fields and fields contains data
ex. Address:
XYZ, Singapore)
• Store the value into destination table in database
• Keep the value into source table in database
• After read pdf and store into destination table, program compare value with source table against few fields

I have done till now :

  1. i am using read pdf with ocr then in it i have taken microsoft ocr engine.
  2. i have taken write text file to get the content from pdf file in text file and created 1 variable for it
  3. then i have taken read text file to read the content and passed that variable in it.

Issues i am facing are :

  1. File is not being read properly row by row and also after reading pdf it is converting some letters into
    special characters

  2. In Few Fields data is not coming sequentially in text file after reading pdf

  • So while extracting data from text file is getting problem, data is not properly extracted in that case.

So, Please anyone can suggest me the way how to read and extract the data from pdf ???

1 Like

Hi @NehaGhodki,

Firstly, Welcome to UiPath community.

As per your current use case, if you are using OCR you might not get 100% accuracy, the results varies and this is due to limitations of OCR.

  1. File is not being read properly row by row and also after reading pdf it is converting some letters into special characters
  • You will get the entire file into a string and then you can split it on the basis of system.environment.newline and store it in an array and then read the array line by line. Here again due to document quality and OCR limitations the execution might not give 100% accurate results so you need to play with scale and different OCR engines and ensure a decent(good) quality pdf to be read.

    2)In Few Fields data is not coming sequentially in text file after reading pdf

    Data may not come sequentially but there will be some pattern which you can identify and then extract the data out of it, for instance, If you want to extract Invoice Number however the Invoice Number is in second line and after that you are getting “Date” in that case you need to first find the index of “Invoice Number” and then extract data between “Invoice Number” and “Date”.
  1. So while extracting data from text file is getting problem, data is not properly extracted in that case.
    Are you using Read text File activity or is it via OCR?
    If Read text file, please gives us a sample file and we will test if there are any issues.
    If via OCR, then point 1 holds the same for this.

Happy Designing!

Regards,
V

Hello @Vikas.Jain,

Thanks :slight_smile: for all your suggetions regarding my issues.

As Data is not coming sequentially, so to extract the data from that fields i am following some pattern. like you said i am first taking the indexof"that perticular field" then finding the starting position of data after that finding out the ending position of it then i am calculating the final count of string from it.

1. Pdf Reading :

2. Pdf Data Extraction from Fields

3. this is the pdf i am working on:
UiPath PDF.pdf (63.0 KB)

4.This the image of text file which contains strings after reading the pdf data

But my point is that if i am using other any pdf instead of using this pdf then how can this logic will work for that ofcourse its fields index, starting position, ending position of data will be completely different so how this pattern will be implemented for that.???

Regards,
Neha

Hi @NehaGhodki,

Yes it might not work and hence you need to first create a sampling and find the patterns for all the fields.

Once you have identified the pattern, you can design it accordingly. There will be tweaks as per the pdf’s too which you need to handle inside your workflow. e.g. In one PDF heading of Invoice Number is "Invoice Num : " and in another it can be "Inv No: " etc

Regards,
V

Hi @Vikas.Jain,

Okay. . i’ll work on pattern.

But in few fields of text file half data in field is coming then 2nd field data is merge into it after that remaining half data comes that’s the big problem.

Thanks & Regards,
Neha

@sreekanth

i have done screenscraping with pdf’s to get the data . try with that.

Hi,

i want to extract only tables from my pdf . My pdf consists of 4 pages at the beginning there are some text written on the first page after that there is only tables. how can i extract only tables from it?

If i will us escreen scraping then how will i store the scraped data in excel file? Any suggestions?

Hi Vikas…
In this case what needs to be done?
Please find me some solution

Thanks and regards,
Aswini

Hello @Vikas.Jain,

Your suggestions helped me, thank you.

But in my scope of work i need to extract data from different PDF files.

issue i am facing is:

Some PDF files output is good with native scraping and others require to use OCR. Is there any way that i can build a BOT which has to take any PDF in folder as input and decide the better way of reading it(native or OCR) by its self.

hi
the following points might be helpful:

1.If the pdf follows a similar format then you can copy the pdf n paste the data in excel. Then use excel scope to retrieve a particular data!!!

  1. You can convert a pdf to text file by using adobe and then manipulate!! (it is bit different then UiPath read option!). the text wont overlap :slight_smile:

hi
share some sample code, it will be useful :slight_smile:

Thanks,