I have been using Regex Based Extractor activity as alternative to Form Extractor,I found some awesome tools like use Explicit Capture to determine groups of extraction
However also I found that the Document Text that comes from the Digitize Document activity contains assorted words (i.e. This text not always bring the text readed from top to bottom and left to right) that increase the complexity of our projects at time to extract text based on anchors or delimiter text.
Idea: Use an property of the Document Object Model Variable called GetVisualTextProjection and use its property ProjectedText to use as input text for Regex Based Extractor to allows us try to extract the text from a projection of the text with sorted text.
I would appreciate if the Document Understanding team can consider this improvement idea for Regex Extraction activity. It can be an optional feature for comparison.
This is already in - if you check the flag on the activity, the use visual alignment flag. At run time, it will use the top to bottom, left to right alignment of the text before applying the regex
Hi, I am missing something - we can select that option, but how do we get this projected text that the regex uses instead of the original so that we know what to code our regex for?
have you found a way to use the the text readed from top to bottom and left to right ?
I have the document created with DOM.GetVisualTextProjection.ProjectedText, but if I use it to set property DocumentText in activity DataExtractionScope, I get the exception “The document text does not match the Document Object Model”
As alternative, setting property UseVisualAlignment=True in activity RegexBasedExtractor and using the standard output documents from activity DigitizeDocument, it seems that it is not using the aligned visual words DocumentText to apply regex.
if you modify the text, you can’t use it as an input for the Data Extraction Scope, this includes (trim, replace, remove…).
The visual alignment property should work. Check if you are using the output string from this property DOM.GetVisualTextProjection.ProjectedText to build your RegEx.
Please let me know if you have additional questions.