I am having an issue that I believe comes from training the intelligent keyword classifiers in Document Understanding. I have created a parallel workflow that cycles through all of the files in a specific folder and sends them for validation if below a certain confidence level on extraction.
For the most part this work fine and runs as intended, however after multiple runs it runs into errors reading the intelligent keyword json file.
Here is one for example:
Classify Document Scope: Invalid character after parsing property name. Expected ‘:’ but got: S. Path ‘[0].TermsWithScores[22]’, line 1, position 32773.
The first 20 or so runs this error was not there. Then after a successful run, a similar error will show up when i try to run again. I cannot seem to figure out what causes this since it runs smoothly for so many attempts and then will randomly fail. If I completely clear the Keyword json file and retrain it, then it goes away, but will happen again some time down the line. This also defeats the purpose of having training classifiers if they must be reset.
I assume this is due to the nature of the parallel loop, however I have no idea how I should go about fixing this. If anyone has had any similar experiences or has an idea of how i could go about fixing this error that would be greatly appreciated.
The issue you are seeing is due to the fact that there is no realistic way for us to ensure file consistency in the case of concurrent updates from within the training activity since the storage used to the training file can be varied (it’s even possible that multiple robots update a shared network file). As such, it is left up to the user.
What you can do is, before the parallel loop, read the contents of the file into a string variable. Use Intelligent Keyword Classifier Trainer with the “LearningData” string argument instead of the file path. At the end of the training write the string back to the original file location, overwriting the old content.
I see thank you for your insight, It makes a lot of sense why this issue is happening. I am still a little confused on how your proposed solution would fully work, although I have the main idea behind it I believe.
Does this mean for training, I run a smaller version of the document understanding workflow not in parallel, where I set the taxonomy, classify, then train the classification for each file before doing a main loop parallel loop where I extract the data? If so that seems like it could take much longer for the program to execute and doubling the classifications seems redundant.
If that wasn’t your initial suggestion I have a few more queries. Since I would be working with both image files and pdfs, it would be more difficult than just reading a pdf natively and creating a string, would I use a normal ocr activity to read the “LearningData” into the string? I am also dealing with multiple different file classifications so how would that work as well? How would the classifier know which words to match with which file types? Also with the new introduction of the machine learning extractor trainer does this mean that the same issues will appear if I was trying to train that?
Again thank you so much for helping with this. This bug has been bothering me for a while and the reason you gave makes total sense. I will remove the classification trainer and experiment with it outside the loop.
The Intelligent Keyword Classifier Trainer activity accepts an in/out argument called LearningData. It expects that this variable contains the contents of a learning file that you would otherwise provide in the LearningFilePath argument. My suggestion was to use a shared string variable in your parallel training instead of reading and writing directly to the file. At the end of the training you can then write the contents of the variable to a file in order to store it.