Regex issue, text group extract

I am using regex grouping to extract data from the text, text i got from using pdf uipath activity . My regex skill are still under development , and i am having issues extracting some of the data
i am grouping because some times there are 10 patient data instead of 1
Below is the string (dummy data)

FCO-LetterRef# 489900-22PLEASE FAX TO 725-248-3602
CONFIDENTIAL
REVIEW DETAIL
2/25/2020            Provider Number/Name: 139635109/DALAS CHLLLLENS HOSPITAL     Review ID:    66626
Review Type : LRG
Patient 625878043  TOPEZ, BIANNEY DOB:6/1/2009 SEX: Male Patient Account #:
ID/Name:
Service From Date: Service Thru Date: Est Overpayment
08/06/2016 08/11/2016 Claim Number: 200040040501923891740652 (Underpayment):$20,555.12
Original: Revised: Original: Revised: Original: Revised:

Data Need
Review ID = 66626
Review Type = LRG
Patient ID = 625878043
Patient Name = TOPEZ, BIANNEY
DOB = 6/1/2009
Service From = 08/06/2016
Service Thru = 08/11/2016
Claim Number = 200040040501923891740652
Est Overpayment = $20,555.12

I was only able to scrape till dob success fully if i got it right

Review ID:[\s]+(?<review_id>\d*)[\s]+Review Type[\s]:[\s]+(?<review_type>.*)[\s]+Patient[\s]+(?<id>\d*) (?<name>.*)[\s]DOB:+(?<dob>.*)[\s]SEX:.*

chaining the above with claim number and the rest of the data is not working for me

This data is NOT PHI , it is dummy data

@Charbel1

a common technique is about first extracting the Block of interest, then extracting more details from this. It helps to reduce the complexity

Block extraction:

Review ID[\s\S]*?\$[\d\,\.]+

you will then later handle also the differences like overpayment vs underpayment etc

1 Like

sometimes i get multiple of hence the grouping, also when i try to get Claim number after [\s]SEX:.* , it whole pattern fails, but idk why.

let’s try to stay in sync for the discussion. Preprocessing with a block extraction is a good option to reduce such issues and sort it out before processing the details.

So would this approach be an option for you or not?

will that work if i get multiple sets of the data? , so the pdf can contain 1 set of data or 10 or more , hard to know where it ends , but each set will have the original text pattern like this

FCO-LetterRef# 489900-22PLEASE FAX TO 725-248-3602
CONFIDENTIAL
REVIEW DETAIL
2/25/2020            Provider Number/Name: 139635109/DALAS CHLLLLENS HOSPITAL     Review ID:    66626
Review Type : LRG
Patient 625878043  TOPEZ, BIANNEY DOB:6/1/2009 SEX: Male Patient Account #:
ID/Name:
Service From Date: Service Thru Date: Est Overpayment
08/06/2016 08/11/2016 Claim Number: 200040040501923891740652 (Underpayment):$20,555.12
Original: Revised: Original: Revised: Original: Revised:

Seems like the VB regex is slightly different than most other type , its actually 99% same but in some cases it does fail, working regex for above is this

Review ID:[\s]+(?<review_id>\d*)[\s]+Review Type[\s]:[\s]+(?<review_type>.*)[\s]+Patient[\s]+(?<id>\d*) (?<name>.*)[\s]DOB:+(?<dob>.*)SEX:.*\s.*\s.*Est Overpayment\s+(?<service_from>\d{2}\/\d{2}\/\d{4})\s+(?<service_thru>\d{2}\/\d{2}\/\d{4})\s+Claim Number:\s*(?<claim_number>.\d*).*:(?<underpayment>\$.*)

This topic was automatically closed 3 days after the last reply. New replies are no longer allowed.