How can i write the data i have extracted from a PDF using OCR into and excel file?

This is the data extracted from the pdf and is stored in the text file.I am not able to get the datas in a seperate column for each name in the excel file.
xxxx xxxx xxxx N.
S/o. xxxxxxx,
3/109, xxxxxxxx Post,
xxxxxxxxx District.
Roll No. 2569/2005
DOE : 28-12-2005
DOB: BG:
xxxx xxxx xxxx N.
S/o. xxxx xxxx xxxx,
16/6, xxxx xxxx xxxx,
xxxx.
Ph xxxx-xxxxxx
Roll No. 218/72
DOE :15-11-1972
DOB: BG:

(post deleted by author)

@Renejit_Vs

Please provide the output you want to extract

Hi @Renejit_Vs

Provide sample input and expected output. I can help you out with regular expressions.

Regards

Input
Abdul Raheem S.
42C, Colony, West Cross Street,
Ramavarmapuram, Nagercoil.
Cell : 9489320924
Roll No. 607/78
DOE : 20-12-1978
DOB : 21-05-1954
BG:

@Renejit_Vs

Do you want to extract the whole text. Please specify.

[A-Za-z]+[\s\S]*?BG\:

You can use the above pattern in Find Matching Patterns and run a For Each loop for that and print the currentItem.

Regards

Abdul Raheem S.
42C, Colony, West Cross Street,
Ramavarmapuram, Nagercoil.
Cell : 9489320924
Roll No. 607/78
DOE : 20-12-1978
DOB : 21-05-1954
BG:
This is the input in the excel i i want them in separate cells like Name in one cell and S/O in one cell and so on. i have around 1000 of these one after the other.
The output i would like to have
PDF.xlsx (8.5 KB)

i i want them in separate cells like Name in one cell and S/O in one cell and so on. i have around 1000 of these one after the other.
PDF.xlsx (8.5 KB)

1 Like

Hi Friend,

You can use regular expression for each Value to extract the data from your pdf file. Assign each value in separate variables and then write them in excel file under there respective columns.

for eg: first use read pdf text activity assign a variable for that and then use assign activity with below command.

System.Text.RegularExpressions.Regex.Match(PDF,“(?<=Roll No. ).*(?=\n)”).ToString.Trim

Hope it will work!
Le me know if you need more help.

Thanks.

The regex is taking the DOE along with the roll no

It is taking only Roll No. value at my side


i am getting this error

Pass your created variable in place of PDF in this expression. The variable you have created in Read Pdf text activity.
System.Text.RegularExpressions.Regex.Match(PDF,“(?<=Roll No. ).*(?=\n)”).ToString.Trim

Can you show me the expression you are using? And also show me the read pdf text activity variable which you are passing in its Properties panel.


System.Text.RegularExpressions.Regex.Match(input,“(?<=Roll No. ).*(?=\n)”).ToString.Trim

Ok. Also show me the assign expression

System.Text.RegularExpressions.Regex.Match(input,“(?<=Roll No. ).*(?=\n)”).ToString.Trim

Hi @Renejit_Vs

Retype the double quotes from the expression

output is there a way i can get all the roll number values instead of 1 and i am getting the DOE along with Roll No. i have multiple roll number in this single file

Roll No. 2569/2005 DOE : 28-12-2005

Roll No. 218/72 DOE :15-11-1972 and so on.