Regex Based Extractor - Improvement Idea

I have been using Regex Based Extractor activity as alternative to Form Extractor,I found some awesome tools like use Explicit Capture to determine groups of extraction

image

However also I found that the Document Text that comes from the Digitize Document activity contains assorted words (i.e. This text not always bring the text readed from top to bottom and left to right) that increase the complexity of our projects at time to extract text based on anchors or delimiter text.

Idea: Use an property of the Document Object Model Variable called GetVisualTextProjection and use its property ProjectedText to use as input text for Regex Based Extractor to allows us try to extract the text from a projection of the text with sorted text.

Hi @Ioana_Gligan

I would appreciate if the Document Understanding team can consider this improvement idea for Regex Extraction activity. It can be an optional feature for comparison.

Regards,

Andres

Hello @AndresTarazona,

This is already in - if you check the flag on the activity, the use visual alignment flag. At run time, it will use the top to bottom, left to right alignment of the text before applying the regex :slight_smile:

2 Likes

Thank you!!!

Hi, I am missing something - we can select that option, but how do we get this projected text that the regex uses instead of the original so that we know what to code our regex for?

Hi @g.ward

You can write that text into a text file.

  1. Digitize your document using the Digitize Document activity
  2. Add a Write Text File activity and on Text property just set DOM.GetVisualTextProjection
  3. Use your extracted text to build your RegEx

Hope it works for you

Regards,
Andres

2 Likes

Hi @AndresTarazona ,

have you found a way to use the
the text readed from top to bottom and left to right ?

  1. I have the document created with DOM.GetVisualTextProjection.ProjectedText, but if I use it to set property DocumentText in activity DataExtractionScope, I get the exception “The document text does not match the Document Object Model”

  2. As alternative, setting property UseVisualAlignment=True in activity RegexBasedExtractor and using the standard output documents from activity DigitizeDocument, it seems that it is not using the aligned visual words DocumentText to apply regex.

Any futher suggestion would be welcome!

Thanks

Massimiliano

Hi @naut1lus

About the points

  1. if you modify the text, you can’t use it as an input for the Data Extraction Scope, this includes (trim, replace, remove…).

  2. The visual alignment property should work. Check if you are using the output string from this property DOM.GetVisualTextProjection.ProjectedText to build your RegEx.

Please let me know if you have additional questions.

Bests,
Andres