Help with Regex for PDF text

I have a PDF file from which I need to extract data:
Can you please help with the regex: I need the values highlighted in bold:

This is the PDF text:
Completion
Date
Preliminary Report
Child Health and Education
Final Report – Complete 05/08/2022
Tracking Screening Report Final Report – Closed
One or more items were not obtained
Child’s Identifying Information
CHILD’S NAME PREFERRED NAME DATE OF BIRTH
Isaiah William Allen Brown 06/07/2006
SEX CHILD’S PERSON ID STUDENT STATE IDENTIFICATION NUMBER (10
Male Female 3033397 DIGITS) 3818004895 N/A

CONSENT PROVIDER ONE NUMBER APPLE HEALTH CORE CONNECTIONS NUMBER
Received 102187765WA
N/A N/A N/A
DOES THE CHILD HAVE LIMITED ENGLISH PRIMARY LANGUAGE IS THE CHILD NATIVE AMERICAN
PROFICIENCY? Yes

Thanks a ton!!

Hi @chauhan.rachita30
Try
(^[a-zA-Z ]*[\d]{2}\/[\d]{2}\/[\d]{4})[\s\S]*^Male Female ([\d]{7})[\s\S]*^Received ([\w]{11})

With assumptions:

  1. The first search item format is customer name with a-z/A-Z and space only then dd/MM/yyyy
  2. The second search item format is Male Female + 7 numbers
  3. The third search item format is Received + 11 digits

Ofc you may adjust the pattern based on your pdf’s behaviors.

Hi @chauhan.rachita30

inputText = Completion
            Date
            Preliminary Report
            Child Health and Education
            Final Report – Complete 05/08/2022
            Tracking Screening Report Final Report – Closed
            One or more items were not obtained
            Child’s Identifying Information
            CHILD’S NAME PREFERRED NAME DATE OF BIRTH
            Isaiah William Allen Brown 06/07/2006
            SEX CHILD’S PERSON ID STUDENT STATE IDENTIFICATION NUMBER (10
            Male Female 3033397 DIGITS) 3818004895 N/A

            CONSENT PROVIDER ONE NUMBER APPLE HEALTH CORE CONNECTIONS NUMBER
            Received 102187765WA
            N/A N/A N/A
            DOES THE CHILD HAVE LIMITED ENGLISH PRIMARY LANGUAGE IS THE CHILD NATIVE AMERICAN
            PROFICIENCY? Yes

Assign activity -> Name = System.Text.RegularExpressions.Regex.Match(inputText,"(?<=DATE OF BIRTH\s+)[\s\S]*?(?=\s+\d+\/\d+\/\d+)").Value.Trim()

Assign activity -> str_Date = System.Text.RegularExpressions.Regex.Match(inputText,"(\d+\/\d+\/\d+)(?=\sSEX)").Value.Trim()

Assign activity -> PersonID = System.Text.RegularExpressions.Regex.Match(inputText,"(?<=Female\s+)\d+").Value.Trim()

Assign activity -> ProviderOneNumber = System.Text.RegularExpressions.Regex.Match(inputText,"(?<=Received\s+)[A-z0-9]+").Value.Trim()

Regards

1 Like

Hello

1st Result:
(?<=DATE OF BIRTH[\n\r]).+
image

2nd Result:
(?<=Male Female )\d+
image

3rd Result:
(?<=Received )\d+[A-Z]+

Cheers

Steve

1 Like

(?<=DATE OF BIRTH\s\n).* → 1st Bold data Extraction

\d+(?=\s+DIGITS) → 2nd Bold data Extraction

(?<=Received\s+)([A-Za-z0-9]+) → 3rd Bold data Extraction

The DOB is coming as blank. Can you help with that: Its the 06/07/2006
Completion
Date
Preliminary Report
Child Health and Education
Final Report – Complete 05/08/2022
Tracking Screening Report Final Report – Closed
One or more items were not obtained
Child’s Identifying Information
CHILD’S NAME PREFERRED NAME DATE OF BIRTH
Isaiah William Allen Brown 06/07/2006
SEX CHILD’S PERSON ID STUDENT STATE

This topic was automatically closed 3 days after the last reply. New replies are no longer allowed.