Not able to retrieve Invoice details using Regex only

Hi,
I am trying to extract invoice details using RegEx extractor extractor only. For extracting invoice number, i tried giving expression and literal ‘Invoice number’ but the value isn’t getting extracted in Validation station. How can this issue be resolved?

Hello

Can you provide a sample, the expected output and tell us about the pattern/information on the text.

Once provided you will have a Regex pattern from the community.

Cheers

Steve

@Steven_McKeering,
This is the input document


I am trying to configure expressions and extract data from this document using RegEx extractor.Now expressions are as given:

Expected output is the accurate data extraction in Validation station:

If expressions syntax is wrong, do help me to correct it.
Hope the information is clear now.Please do help me solve this one.

Hello

We need a sample in text form.

How are you getting the PDF into UiPath? I would recommend using the “ReadPDF” activity from the Uipath.PDF.Activites package.

Install package

Use Read PDF activity
image

Save Output
image

Then upload the file or paste the sample in a reply.
Then we can use Regex :slight_smile:

Cheers

Steve

1 Like

@Steven_McKeering,
Document is not PDF, it is in JPG format. Its path is given in Data extraction scope and the RegEx based extractor is applied to extract the required data from document. But the problem is in executing it.

Hello

Can they send it another format?

@Steven_McKeering,
Converted it to PDFStructured document.pdf (236.1 KB)
I think text clarity is comparitively low in PDF and it may affect the accuracy in data extraction.

Hello again

So what part did you need from this text? Please bold it :slight_smile:

Ansari Nagar. New Dethl -
Entrance Examination - 2018
Candidate Profile Candidate ID: 5181103006 Registration No: 2088407
Date o’ Birth: 03 Nov 1995
Category: General
Mothers Name: KUAUM JOSHI
Mentioned Specified Disability: NA
State of Domicile:
i Year of appearan:e: NA
Address: Registration Date: 06102/2016
'Canfiate Name: KAMAL K’SHOR JOSHI
Gender: Male
Fathers Name: BHAGWATI PRASAD JOSHI
PWBO Status: No
Nationality: tNDlAN
Have you appeared at AIMS MBBS Entrance Exam earlier No
Language in which Question paper is desired: English
Contact Details
Address for Permanent:
HOUSE NO • 42. LANE • 4, NEAR NAGAR PALI". Pithoragarh.
Uttarakhand. ln&a. 262501
Mobile No: a HOUSE NO • 42. LANE -4, NEAR NAGAR PALI". Pithoragarh.
Unarakhand. 262501
Quatltleat[on DetallS
Qualityng Exam
Senlor scnoo: Certificate Exam (10+2)
Academic Detans
'Qualifying Exam Status
Appearing •Oualjfying Exam Status
Åppearing
Scoring Scheme E •Mail
'Class 10tn Roll no.
1181441136
Max Marks
0.00 *am Board Name
'Uttarakhand Vidnalaya SnlkSha Parlsnad State Name
Uttaramand
•Marks Obtained
Valid Photo Identity (To be presented in original at the Examination Center along wtth Admit Card)
ID Proof: Adhar Card
payment Details
Moae: Online No
Date: 06102/2018 Place of Issue: India
iTransacuon to: 6110300639 Issue Date: NA Percentage(%)
NA ivatid Till: NA
•Amount: 1500
Examination CiW Opted: Dehradun
UNDERTAKING/DECLARATION: hereby declare that the information fumished by me in the RegistratioNApplication Form is correct and nothing has
concealed, In case any information fumished by me is found to be faLsehncorrecVunuue than i shall be liable to civiVciminal prosecution and my claim to
admission/appointment/registrationf service in the Institute may be cancelledlteminated.
Slgnature ot Candidate Thumb of Candidate
m

@Steven_McKeering,
Date of Birth: 03 Nov 1995
Category: General
Mothers Name: KUAUM JOSHI
ID Proof: Adhar Card
Nationality: INDlAN
Candidate Name: KAMAL KISHOR JOSHI
Gender: Male
Fathers Name: BHAGWATI PRASAD JOSHI

These are the required details

1 Like

Hello again

I have built a workflow to extract all the required pieces of information but as you mentioned - the OCR engine had some small trouble.

You will likely need to review the method of reading/extracting the data into UiPath.

Main.xaml (13.5 KB)

Regex101 preview links
DOB Link
Category
Mothers, Candidate and Fathers Name
ID Proof
Nationality
Gender

Hopefully this helps :slight_smile:

Also - if you want to Learn Regex - check out my Regex MegaPost

Cheers

Steve

1 Like

@Steven_McKeering,
It’s working. Is it possible to replicate this regex in RegEx based extractor? I am getting only Date of birth. RegEx based extractor trains any type of documents and help to extract required data.
Attaching the screenshot

1 Like

Hello

That’s good news.

Maybe double check the OCR and play with the Regex Options field…

Hopefully this helps :blush:

Cheers

Steve

@Steven_McKeering,
Please expand the statement ‘Play with the RegEx option field’? It is now given as ‘Ignore case’ and ‘Single line’

Hello

Sorry - what I mean is:

  1. You might need to play with the Regex Patterns and options to make them more robust
  2. You might need to review the best way to capture the data. Can you get the data another way? OCR and Regex can be a bad combination as there are lots of variables. It can be too unpredictable with what the OCR engine will pickup and thus hard to use Regex .

Essentially if you are using OCR - it should only be a last resort and you need to make sure your Regex is as robust as possible.

Hopefully this clears up what I meant.

Cheers

Steve

2 Likes

Hi @Steven_McKeering,
Thanks for the explanation. It’s clear now.
Attaching another pdf where values corresponding to ‘name’ and other fields are placed below the field names and not to the right.
visa Template 1-converted.pdf (61.9 KB)
What can be done to extract data here?