How to use Regex Based Extractor with an plain TXT file?

Hello to you all!

I have a process which needs to get a lot of information from a text file. Because there are many different fields (but also always the same ones) I want to know if there is a possibility to use the Intelligent Document Understanding techniques specially, the Regex Based Extractor.

I’ve already built my taxonomy for this kinds of txt files with all the required fields, there is also a table in text format…

Now, if I use the basic activities and setup as in an image file, the activity “Digitalize document” will fail if I use my txt file as input because plain text files are not allowed.

The only thing I need to have to be able to use all the other activites (classifiers, data extraction scope + regex based extractor) would be to have the document object model set up, like if it was created by the digitalize activity.

Can you help me? Do you know any other way that I can take advantage of this extractor? I found it very useful because it will, for example, automatically split the fields accordingly with the taxonomy.

Thank you very much in advance.

Document understanding takes as input document types such as PDFs or images. After converting it to text using OCR, the Regex Extractor can be applied to extract information from the document using regular expressions.

Given that your use case starts with a .txt file, it’s not required to convert it to text and extract data using document understanding functions. You could just use the Read Text File activity to read the file into your robot, and use the Match activity to apply the regex that you want to extract the individual data points.

Hello.

Yes, you are totally right. But given that the txt file is some complex because it can have tables in it, I was hoping there is a way I can take advantage of the extractors. The only I need to do then, is to manually create the DOM, so all the activities (but the digitalize document) will be able to work with it.

Any idea how I can manually create the DOM?

I’m not sure, but either way if you want to use document understanding and the regex extractors, it also won’t help with the tables because the regex extractors can extract individual field values only, not a table.

If you have a table in a .txt file looks like you’ll have to try various string manipulation methods to extract data that you want (e.g. splitting text into arrays and treating individual elements of the array as individual cells).

Ok sir, yes are right. I think I will have to try a different approach then, thank you!