Regex Based Extractor Not Working

chintan.patel · February 16, 2021, 6:28am

Hi all,

Firstly, I am new to the Uipath, so please pardon if I am not using the right terminology.

We have just deployed a project to parse PDF invoices via data extractor, which includes position based extractor and regex based extractor. It worked well till few of the invoice layout have been altered, hence we can to add more reg ex patterns. For some reason, the reg ex pattern using the “UiPath.IntelligentOCR.Activities.DataExtractation.RegexBasedExtractor” is not working as expected.

For an Example, a pattern in regex editor below works fine in the test text but not with the actual PDF

A snapshot of PDF is as per below.

Any help is much appreciated.

Thanks,
Chintan

NIVED_NAMBIAR · February 16, 2021, 6:44am

Hi @chintan.patel
I think problem related spacing may be there

Check that too

chintan.patel · February 16, 2021, 6:48am

Hello @NIVED_NAMBIAR, yes I have already checked the spacing but not related with the spaces. I guess it may be related to text version of PDF which does not align with what has been displayed on the PDF.

i.e. text version of PDF

copy_writes · February 16, 2021, 6:49am

what @NIVED_NAMBIAR said its correct only may be space problem use this look behind regex (?<= Due $)\d+

chintan.patel · February 16, 2021, 7:17am

HI @copy_writes - If I understood you correctly, I have changed my regex pattern to be ((?<= Due\s*$)((\d+.\d{2}))) but still didn’t work.

For some reason, its scanning all the values in PDF with digits. i.e. validation station display as per below after the data extraction activity.

prasath17 · February 16, 2021, 1:56pm

@chintan.patel …yes…you have to create a pattern for what is available in the text format or else Regex wont work …

But may I ask why are you choosing Regex for Amout Due? This can be easily captured with Form extractor or Intelligent form extractor.

chintan.patel · February 17, 2021, 1:17am

Hello @prasath17
What is the intelligent form extractor? I had to choose regex because the text “Amount Due” is not fixed on the form, it would change depending upon number of lines.

prasath17 · February 17, 2021, 2:17am

@chintan.patel - If the Amount Due position is not fixed, then Regex based extractor and Intelligent form extractor won’t work. In that case, you have to go with ML Extractor.

chintan.patel · February 17, 2021, 5:04am

So there are no other alternatives?

prasath17 · February 17, 2021, 5:08am

@chintan.patel …ML Extractor…if its a invoice you can add the Invoice endpoint which will extract the amount due

chintan.patel · February 17, 2021, 5:14am

I had ML but its not cost effective. Also, it doesn’t scan everything I need, so I will have to invest in my own ML end point. Anyway, thanks for your help.

Topic		Replies	Views
Regex Based Extractor not extracting while executing Activities activities , question , document_understanding	7	1037	August 5, 2022
Pdf invoice processing, Regex Extractor Help	3	1346	October 26, 2019
Regex Based Extractor Not Extracting Data But Regex Builder Says It'll Work Document Understanding studio , regex , question	3	960	July 18, 2020
How do we use regex based extractor to work on text extracted by form extractor in UiPath? Document Understanding studio	3	1376	December 26, 2020
How to use the Intelligent OCR for any PDF(other than invoice ) ? Both by Regex and Machine Learning Extractor? Studio uiautomation , activities	7	2472	September 4, 2020

Most Active Users - Yesterday
Ajay_Mishra
ashokkarale
Abhi_Nande
Asantewaa_Mantey
mikko1
E.Y.9
Phenyo
More details...

Regex Based Extractor Not Working

Related topics