I really enjoy the UX interface and workflow of the validation station that’s included in the intelligentOCR activities. However, I am having an issue with how the digitize document outputs the text being read from tables. It’s randomized and I am unable to apply regex extractor to group the appropriate data. The form extractor works well for a single use case but the row sizes are different for each table on every page.
I am able to extract/format/group the data using string manipulation and regex into a DataTable correctly either using the native text screen scraper or read pdf text activities.
My suggestion, is it possible to have a very basic version of the validation station where it takes an input type of DataTable and outputs the validated DataTable. This would create a more seamless workflow (able to view the pdf and edit the data table in the same way it is presented in the current intelligentOCR activity) as opposed to having to open the excel file next to the pdf and then make changes if needed after the workflow is complete.
This would remove items such as taxonomy, DOM, confidence, and being able to select the text on the pdf (as OCR is not being used). It would just be a user friendly way for someone to edit/validate a data table before it gets written to an excel file and used in other workflows.