Regex Grouping When there is multiple Pattern?

I have this text i am getting from PDF using uipath pdf extraction,

Problem
Usually i get pattern together meaning names will be together , Review Messages will be together. As you can see the text below is from a tabular view in pdf and the name “JASON DIANNA JOHN” is not together unlike “JAY, BAY” , also “Review Message(s):”, instead of text together like “Documentation does not support and billing error of inpatient could have been billed as outpatient” its broken in between other data i need, DOB 2 different location.

Data I need
Ex: from the first row of data
Patient Id: 521178201
Name: JASON DIANNA JOHN
DOB: 10/5/2002
Review message: Documentation does not support medical necessity
Service From Date:02/09/2017
Service thru Date:05/30/2015
Claim number:100050010111901715802222

This is something i started i am getting stuck at the name since name is broken up

Patient ID \/ Name:\s+(?<id>\d*) (?<name1>.*)\s+.*DOB:\s(?<dob>\d+\/\d+\/\d{4})

So how to get the broken up data , using name 1 , name 2,review msg1 and 2 then join? I do need this is group instead of individual.

Patient ID / Name: 521178201 JASON, Sex: Female Patient Account # :
DIANNA JOHN DOB: 10/5/2002
Service From Service Thru  Claim Number: Review Message(s): Documentation does not support
Date: 02/09/2017 Date: 100050010111901715802222 medical necessity
02/10/2020


Patient ID / Name: 310976610 JAY, BAY Sex: Female Patient Account # :
DUAA DOB: 7/2/2007
Service From Service Thru  Claim Number: Review Message(s): Documentation does not support
Date: 02/10/2013 Date: 10006004125561531161888 medical necessity
05/30/2015


Patient ID / Name: 666310555 Anie, Baby DOB: 4/15/2016 Sex: Male Patient Account # :
Service From Service Thru  Claim Number: Review Message(s): Documentation does not support
Date: 03/10/2010 Date: 100055530201666962948521 medical necessity
03/22/2014

Patient ID / Name: 222333136 Anu, Json DOB: 1/15/2012 Sex: Female Patient Account # :
Service From Service Thru  Claim Number: Review Message(s): Documentation does not support
Date: 05/04/2011 Date: 100020030201504215275522 and billing error of inpatient could have
11/22/2012 been billed as outpatient.

All data is Dummy data , not real data

1 Like

Hey @Jay_Chacko

Sorry, I’m not getting the requirement here.

The name is already split as I can see above, but what do you want to do with name is not understandable for me, my bad. Please explain.

Thanks
#nK

Hi @Jay_Chacko,

Have you Checked the Extraction by Checking the PreserveFormat as True ?

@Jay_Chacko Which OCR are you using. Did you try with Tessract OCR

I just need to extract all data via regex grouping, nothing wrong with text returned from pdf read

I just need to extract all data via regex grouping, nothing wrong with text returned from pdf read or its settings

@Jay_Chacko ,

We wanted to know if the both the formats of Text Retrieval using Read Pdf Text Activity is Checked. With PreserveFormat set to True and PreserverFormat set to False.

If haven’t Checked yet, you could Check, so that there might be a Possibility of retrieving Text format in a Better way for Regex Extraction.

i see, never did that due to it introduce weird extra space and characters , but just tested seems like its broken up more, easier to visually see but harder for regex i am assuming

Patient ID / Name:       222333136 Anu, Json              DOB: 1/15/2012         Sex: Female            Patient Account # :
Service From             Service Thru       Claim Number:                               Review Message(s): The documentation does not support
Date: 05/04/2011         Date:             100020030201504215275522                     medical necessity and billing error of inpatient could have
                         11/22/2012                                                     been billed as outpatient.

@Jay_Chacko , Do we See different formats, like the above or are all Data in the same format ?