Regex Based Extractor - Improvement Idea

AndresTarazona · May 29, 2020, 1:37pm

I have been using Regex Based Extractor activity as alternative to Form Extractor,I found some awesome tools like use Explicit Capture to determine groups of extraction

However also I found that the Document Text that comes from the Digitize Document activity contains assorted words (i.e. This text not always bring the text readed from top to bottom and left to right) that increase the complexity of our projects at time to extract text based on anchors or delimiter text.

Idea: Use an property of the Document Object Model Variable called GetVisualTextProjection and use its property ProjectedText to use as input text for Regex Based Extractor to allows us try to extract the text from a projection of the text with sorted text.

AndresTarazona · June 23, 2020, 3:18am

Hi @Ioana_Gligan

I would appreciate if the Document Understanding team can consider this improvement idea for Regex Extraction activity. It can be an optional feature for comparison.

Regards,

Andres

Ioana_Gligan · June 23, 2020, 7:24am

Hello @AndresTarazona,

This is already in - if you check the flag on the activity, the use visual alignment flag. At run time, it will use the top to bottom, left to right alignment of the text before applying the regex

AndresTarazona · June 23, 2020, 2:51pm

Thank you!!!

g.ward · February 10, 2021, 1:28pm

Hi, I am missing something - we can select that option, but how do we get this projected text that the regex uses instead of the original so that we know what to code our regex for?

AndresTarazona · February 10, 2021, 3:32pm

Hi @g.ward

You can write that text into a text file.

Digitize your document using the Digitize Document activity
Add a Write Text File activity and on Text property just set DOM.GetVisualTextProjection
Use your extracted text to build your RegEx

Hope it works for you

Regards,
Andres

naut1lus · April 22, 2021, 9:20am

Hi @AndresTarazona ,

have you found a way to use the
the text readed from top to bottom and left to right ?

I have the document created with DOM.GetVisualTextProjection.ProjectedText, but if I use it to set property DocumentText in activity DataExtractionScope, I get the exception “The document text does not match the Document Object Model”
As alternative, setting property UseVisualAlignment=True in activity RegexBasedExtractor and using the standard output documents from activity DigitizeDocument, it seems that it is not using the aligned visual words DocumentText to apply regex.

Any futher suggestion would be welcome!

Thanks

Massimiliano

AndresTarazona · April 24, 2021, 5:45pm

Hi @naut1lus

About the points

if you modify the text, you can’t use it as an input for the Data Extraction Scope, this includes (trim, replace, remove…).
The visual alignment property should work. Check if you are using the output string from this property DOM.GetVisualTextProjection.ProjectedText to build your RegEx.

Please let me know if you have additional questions.

Bests,
Andres

Topic		Replies	Views
How to use Regex Based Extractor with an plain TXT file? Activities activities , question , document_understanding	4	453	April 13, 2023
Matches Activity Works but Regex Based Extractor with Same Expression Not Working Document Understanding question , document_understanding , regex-extractor	5	369	August 29, 2023
Regex Based Extractor Not Extracting Data But Regex Builder Says It'll Work Document Understanding studio , regex , question	3	910	July 18, 2020
How to use Regex based extractor activity Activities uiautomation , activities , question	4	1171	October 16, 2020
How do we use regex based extractor to work on text extracted by form extractor in UiPath? Document Understanding studio	3	1291	December 26, 2020

Most Active Users - Yesterday
ashokkarale
Anil_G
Yoichi
yangyq10
postwick
chandreshsinh.jadeja
aravindbalineni123
Parvathy
aya
PRASHANT_GABHANE
More details...

Regex Based Extractor - Improvement Idea

Related Topics