Hi all
Pretty big ramble here so please bear with me
We are working on a problem where we are using DU to ultimately ingest files of one type into a system, lets call these cover pages. We will be performing text extraction on these at a later stage (performer) but for now, in the dispatcher, the goal is to extract a number of documents from a large, single master pdf.
Now, I am using DU & the Intelligent Keyword Classifier to split these documents based on the cover pages.
I manually split one of these documents up and carefully extracted only the cover pages out to train this model which I have done.
So the plan is the document gets digitized and then classified and using these classification results, I use the Extract PDF Page Range activity feeding in the DocumentBounds.StartPage and a little logic using DocumentBounds.PageCount to generate a string of the format “2” or “2-4” etc for the range. This approach works nicely.
My first question.
The example I am using to dev with is a 96 page pdf and the digitization process takes a r-e-a-l-l-y long time and absolutely pins the cores on my cpu for the duration. That in itself is no big issue but I do have concerns when this scales up to dealing with bigger numbers of these. Does anyone know of any optimizations I can make with the digitization step?
Another question.
Some of these cover pages actually stretch to 2 pages depending on the volume of content in them, so I have trained with these as 2 pages and some are single page documents. Does that pose a problem? I only ask as I am getting mixed results in classification.
Ultimately, I think we are going to be using the Classification station within the Action Centre to provide human in the loop input to help us split these initially, feeding that back into the training using the Train Classifiers scope but I had hoped for better results without that out of the gate.
I did actually think that I should perhaps train the system using the worst case first page of a covering document. For example, work out a worst case document and see what content is on that and use that to train?
I guess this is difficult as the Intelligent Keyword Classifier takes granular control of the word list it has built away from me and would appreciate your thoughts
Andy