I’m looking for a robot able to categorise PDF (which can be multi pages) by a specific format, then extract the data depending on this particular format and also if format not found, to perform machine learning on this new format to then save this format for new future scan doc.
I’ve heard and try to use Intelligent OCR package, but quite difficult to use without any knowledge.
Can some one help me? Any tuto? of document how to use it?
Hello @loana_Gligan and @loginerror ,
I’ve seen the workflow and its great, but i’m wondering about something; there is a validity station, does this activity use for the machine learning? or it will popup for all invoice scan?
I try to run the robot having same format of invoice in my input folder and each time the validity station popup.
Normally my needs is to perform OCR on all different types of invoices and if the formatting is unknown for the bot, it then perform the validity station to learn this new pattern.
or to save all unkonow type invoice in a seperate folder for later on machione learning by a human.
You will need to write your own logic around whether to show the validation station or not.
Please note that currently the machine learning extractor does not expose training capabilities for the community edition.
If you use a limited number of invoice formats for now, you can write some basic logic (are all fields I need extracted? any missing? etc), and actually test on those invoice types, and decide if you want to show the validation station or not.
If you have unknown invoice formats, then I strongly recommend to discuss with the business to decide if accuracy is critical.
If some errors are allowed, then I would test for the fact that certain important fields have values, and if they have, then not show the validation station. This is for cases when it’s okay to have a certain degree of mistakes, as it would be sometimes (few cases) cheaper to correct those cases than to validate all cases.
If no errors are allowed, then I would ALWAYS show the validation station. This is because even if the extrctor might return the right value (from the right place), there might be OCR issues that you will not see (like a zero identified as letter O), but you will want to correct.