I’m working on a project to extract name, email address and other details from the following document ( Image). By using OCR function I can get the details to a string. but the problem is how to segregate details from it and fill a form
You have to create a parser to get all needed data, but since it is string from OCR there will lots of problems, because string won’t be so structurized like form native file. Could you share yours string with 2 examples ?
Hello , As Far as I have understood your query, there shouldn’t be any problem extracting the data using String Manipulation if the data extracted contains everything you need. Just Need to go through String Manipulation Functions
1 Mr Rena Bitter Mccarthy firstname.lastname@example.org PO BOX 22 Via Cabrira en
Jacksonville FL 32267 (618) 323-3302 United States MALE Wednesday, 07/26/1950 180 175 A+ Mr Rena Bitter Mccarthy Jerusalem Boyle MS 38730 United States (493) 370-5131 YES YES NO YES NO VISsd_4srnP-17517 Wednesday, 07/26/1950 $150.00 Mr Rena Bitter Mccarthy Charge Michael Klecheski axi6D_XV8279151 Wednesday, 07/26/1950 MALE Other XANAX 5 MG 78 $0.90 $70.20
2 Amgad Ilsibai Holt a mga holt11322ers@re/maxin.so 11274 Lakiviiw Main Street i. Hiam Rd Janesville WI 53545 (558) 190-1674 United States MALE 07/13/1965
172 162 AB+ Amgad Ilsibai Holt Mike Greene Chapel Hill NC 27516 United States (191) 443-6922 NO NO NO NO N O VJSsd_4srnP-19219 07/13/1965
$250.00 Amgad Ilsibai Holt Dianne M Smith axi6D_XV8-264146 07/13/1965 MALE Other PHENTERMINE 5 MG 50 $2.75 $137.50 $20.00 $157.50
3 Clay Hamilton Myers clayers email@example.com 11930 Rt NN Patricia Ct.
Inez TX 77968 (298) 371-8810 United States FEMALE Thursday, 01/22/1959 173 182 B+ Clay Hamilton Myers Laura Martin Natchitoches
LA 71458 United States (691) 161-2965 NO NO NO NO NO
VJSsd_4srnP-28432 Thursday, 01/22/1959 $250.00 Clay Hamilton Myers Darrell Taylor axi6D_XV8-262225 Thursday, 01/22/1959 MALE Other PHENTERMINE 37.5 MG 50 $5.37
$268.50 $20.00 $288.50
4 Pasqual Duarte Wade pasquate wade firstname.lastname@example.org 12225 Criscint Street CenTer Friiway Dr. Eastland TX 76448 (232) 823-2201 United States MALE
Friday, 09/27/1946 170 170 AB+ Pasqual Duarte Wade Betty Spallen Akron IA 51001 United States (138) 057-4323 YES YES NO YES NO
VJSsd_4srnP-30487 Friday, 09/27/1946 $250.00 Pasqual Duarte Wade Gary David Fisher axi 6D_XV8-285510 Friday, 09/27/1946 MALE Master Card XANAX 2 MG
50 $4.73 $236.50 $20.00 $256.50
COL John Rucks Grant email@example.com 12581 N Skidmori St Wintworth Avi
Dallas TX 75326 (352) 267-6855 United States MALE Tuesday, 01/03/1961 175 182 AB+ COL John Rucks Grant Charles Deng East Wilton ME 4234 United States (713) 638-6107 NO NO NO YES NO VSsd_4srnP-37993 Tuesday, 01/03/1961 $300.00 COL John Rucks Grant Christopher Quinlivan axi6D_XV8-266820 Tuesday, 01/03/1961 FEMALE American Express PHENTERMINE 1 MG 50 $3.94 $197.00 $20.00 $217.00
Carolyn Dirrim Campbell carolyampbe116456@cedarfa. no 13156 Statton Main Street 6TH AViNui Mamou LA 70554 (559) 382-9744 United States FEMAEL 05/27/1956
171 170 O+ Carolyn Dirrim Campbell Charies Stahl Creedmoor NC 27522 United States (859) 625-2149 NO NO NO YES NO VJSsd_4srnP-10127 05/27/1956 $200.00 Carolyn Dirrim Campbell Patricia Register axi6D_XV8-274122 05/27/1956 MALE Discover VALIUM 37.5 MG 78 $4.07 $317.46 $20.00 $337.46
7 Miss. John E. Buhler DRIVI Ponder TX
10/16/1958 171 173
49849 United States 15897 10/16/1958 $350.00
10/16/1958 FEMALE $268.27 firstname.lastname@example.org 13711 W. 77 TH PLWINDSOR Kiip 76259 (660) 655-5383 United States FEMALE AB+ Miss. John E. Buhler Binta Kawu Ishpeming MI (794) 745-9723 NO NO NO YES NO VISsd_4srnp. Miss. John E. Buhler Usha Pitts axi6D_XV8-257237 Other VALIUM 2 MG 61 $4.07
8 Samuel Adams Rosales samueosales email@example.com 14489 Alliston North Road 67th Avi Ni
Ah Gwah Ching MN 56430 (517) 824-1725 United States FEMALE Friday, 07/27/1962 175 174 0+ Samuel Adams Rosales Dick
the problem i’m facing is how to extract the details like name , email address from the string.
i’m getting many images like this ( but none of them are standardized ) which i have to process to the above mention application form.
Like I said it won’t be easy to parse that data. But you have to create some methods that will extract text based on some “pointers” and allow you to retrive needed information. Example email: you know that it has “@” so you could use it to extract it from the text. If you know where an email is then you can get an adress which is next to that field, and so on. And remeber to spend extra hours to test it.
To get an email you can use something like : strArr = string.Split("@“c) then to get 1st part: textArr = strArr(0).Split(” “c), emailName = textArr( textArr.Count - 1), get 2nd part of an email: textArr = strArr(1).Split(” "c), emailDomain = textArr(0).
@charith_wickramasing, Yeah, Is it possible to explain what exactly are those values in the output. If it sticks to a preferred format, We might be able to use Regex and try solving it. But as @rado mentioned this is not a easy one to Solve What do you think about using Regex @rado
By extracting data from the string i meant all possibly method :D, of course regex is in that group. Also there will be nice to create some custom activities which will help to exctract data to a easy readable format, so custom DTO is in my opinion a must.