I am unable to read and extract data from pdf file

NehaGhodki · April 25, 2017, 10:25am

My Scope of Work Is

• Read any 15 fields in the pdf file (pdf contains some fields and fields contains data
ex. Address:
XYZ, Singapore)
• Store the value into destination table in database
• Keep the value into source table in database
• After read pdf and store into destination table, program compare value with source table against few fields

I have done till now :

i am using read pdf with ocr then in it i have taken microsoft ocr engine.
i have taken write text file to get the content from pdf file in text file and created 1 variable for it
then i have taken read text file to read the content and passed that variable in it.

Issues i am facing are :

File is not being read properly row by row and also after reading pdf it is converting some letters into
special characters
In Few Fields data is not coming sequentially in text file after reading pdf

So while extracting data from text file is getting problem, data is not properly extracted in that case.

So, Please anyone can suggest me the way how to read and extract the data from pdf ???

Vikas.Jain · April 25, 2017, 10:54am

Hi @NehaGhodki,

Firstly, Welcome to UiPath community.

As per your current use case, if you are using OCR you might not get 100% accuracy, the results varies and this is due to limitations of OCR.

File is not being read properly row by row and also after reading pdf it is converting some letters into special characters

You will get the entire file into a string and then you can split it on the basis of system.environment.newline and store it in an array and then read the array line by line. Here again due to document quality and OCR limitations the execution might not give 100% accurate results so you need to play with scale and different OCR engines and ensure a decent(good) quality pdf to be read.

2)In Few Fields data is not coming sequentially in text file after reading pdf
Data may not come sequentially but there will be some pattern which you can identify and then extract the data out of it, for instance, If you want to extract Invoice Number however the Invoice Number is in second line and after that you are getting “Date” in that case you need to first find the index of “Invoice Number” and then extract data between “Invoice Number” and “Date”.

So while extracting data from text file is getting problem, data is not properly extracted in that case.
Are you using Read text File activity or is it via OCR?
If Read text file, please gives us a sample file and we will test if there are any issues.
If via OCR, then point 1 holds the same for this.

Happy Designing!

Regards,
V

NehaGhodki · April 25, 2017, 11:49am

Hello @Vikas.Jain,

Thanks for all your suggetions regarding my issues.

As Data is not coming sequentially, so to extract the data from that fields i am following some pattern. like you said i am first taking the indexof"that perticular field" then finding the starting position of data after that finding out the ending position of it then i am calculating the final count of string from it.

1. Pdf Reading :

2. Pdf Data Extraction from Fields

3. this is the pdf i am working on:
UiPath PDF.pdf (63.0 KB)

4.This the image of text file which contains strings after reading the pdf data

But my point is that if i am using other any pdf instead of using this pdf then how can this logic will work for that ofcourse its fields index, starting position, ending position of data will be completely different so how this pattern will be implemented for that.???

Regards,
Neha

Vikas.Jain · April 25, 2017, 12:01pm

Hi @NehaGhodki,

Yes it might not work and hence you need to first create a sampling and find the patterns for all the fields.

Once you have identified the pattern, you can design it accordingly. There will be tweaks as per the pdf’s too which you need to handle inside your workflow. e.g. In one PDF heading of Invoice Number is "Invoice Num : " and in another it can be "Inv No: " etc

Regards,
V

NehaGhodki · April 25, 2017, 12:52pm

Hi @Vikas.Jain,

Okay. . i’ll work on pattern.

But in few fields of text file half data in field is coming then 2nd field data is merge into it after that remaining half data comes that’s the big problem.

Thanks & Regards,
Neha

NehaGhodki · January 11, 2018, 1:53pm

@sreekanth

i have done screenscraping with pdf’s to get the data . try with that.

aamir · January 15, 2018, 2:03pm

Hi,

i want to extract only tables from my pdf . My pdf consists of 4 pages at the beginning there are some text written on the first page after that there is only tables. how can i extract only tables from it?

SHAISTA · February 2, 2018, 10:53am

If i will us escreen scraping then how will i store the scraped data in excel file? Any suggestions?

aswini_sai · February 9, 2018, 3:32pm

Hi Vikas…
In this case what needs to be done?
Please find me some solution

Thanks and regards,
Aswini

ash_kettchup · February 21, 2018, 5:51am

Hello @Vikas.Jain,

Your suggestions helped me, thank you.

But in my scope of work i need to extract data from different PDF files.

issue i am facing is:

Some PDF files output is good with native scraping and others require to use OCR. Is there any way that i can build a BOT which has to take any PDF in folder as input and decide the better way of reading it(native or OCR) by its self.

arathi · March 12, 2018, 11:34am

hi
the following points might be helpful:

1.If the pdf follows a similar format then you can copy the pdf n paste the data in excel. Then use excel scope to retrieve a particular data!!!

You can convert a pdf to text file by using adobe and then manipulate!! (it is bit different then UiPath read option!). the text wont overlap

vr24 · April 20, 2018, 4:06am

hi
share some sample code, it will be useful

Thanks,

Topic		Replies	Views
I am unable to read and extract data from pdf file Help	2	1439	March 31, 2018
How to extract data from digitize pdf Studio studio , question , activities_panel	4	27	March 28, 2025
How to extract form values or editable text from PDF files? Help	3	4511	November 21, 2018
How to extract data from multiple pdf Academy Feedback studio	6	5002	September 18, 2019
Extract data from pdf document Help pdf , activities , question	18	2040	February 3, 2020

I am unable to read and extract data from pdf file

Related topics