How to use Regex Based Extractor with an plain TXT file?

Patricio_Cainzos · April 12, 2023, 8:02pm

Hello to you all!

I have a process which needs to get a lot of information from a text file. Because there are many different fields (but also always the same ones) I want to know if there is a possibility to use the Intelligent Document Understanding techniques specially, the Regex Based Extractor.

I’ve already built my taxonomy for this kinds of txt files with all the required fields, there is also a table in text format…

Now, if I use the basic activities and setup as in an image file, the activity “Digitalize document” will fail if I use my txt file as input because plain text files are not allowed.

The only thing I need to have to be able to use all the other activites (classifiers, data extraction scope + regex based extractor) would be to have the document object model set up, like if it was created by the digitalize activity.

Can you help me? Do you know any other way that I can take advantage of this extractor? I found it very useful because it will, for example, automatically split the fields accordingly with the taxonomy.

Thank you very much in advance.

yikwen.goo · April 13, 2023, 5:45am

Document understanding takes as input document types such as PDFs or images. After converting it to text using OCR, the Regex Extractor can be applied to extract information from the document using regular expressions.

Given that your use case starts with a .txt file, it’s not required to convert it to text and extract data using document understanding functions. You could just use the Read Text File activity to read the file into your robot, and use the Match activity to apply the regex that you want to extract the individual data points.

Patricio_Cainzos · April 13, 2023, 3:33pm

Hello.

Yes, you are totally right. But given that the txt file is some complex because it can have tables in it, I was hoping there is a way I can take advantage of the extractors. The only I need to do then, is to manually create the DOM, so all the activities (but the digitalize document) will be able to work with it.

Any idea how I can manually create the DOM?

yikwen.goo · April 13, 2023, 3:37pm

I’m not sure, but either way if you want to use document understanding and the regex extractors, it also won’t help with the tables because the regex extractors can extract individual field values only, not a table.

If you have a table in a .txt file looks like you’ll have to try various string manipulation methods to extract data that you want (e.g. splitting text into arrays and treating individual elements of the array as individual cells).

Patricio_Cainzos · April 13, 2023, 4:15pm

Ok sir, yes are right. I think I will have to try a different approach then, thank you!

Topic		Replies	Views
Regex Based Extractor - Improvement Idea Activities activities , completed , feedback	7	1687	April 24, 2021
Can I use just taxonomy combined with regex based extractor? Alternatives? Document Understanding	4	394	May 8, 2023
How to use Regex based extractor activity Activities uiautomation , activities , question	4	1185	October 16, 2020
Matches Activity Works but Regex Based Extractor with Same Expression Not Working Document Understanding question , document_understanding , regex-extractor	5	378	August 29, 2023
Get table from regex based extractor Help activities , regex , question , intelligent_ocr	1	997	August 13, 2020

Most Active Users - Yesterday
Yoichi
Gautham_Pattabiraman
Anil_G
lrtetala
ashokkarale
Angel_Meseguer_piqueras
FINNNNNNNN
kardelencihangir
ayumi.ouchi
Gabriele_Radici
More details...

How to use Regex Based Extractor with an plain TXT file?

Related Topics