Classification and Splitting

Hello,

My customer sent me 4 sample PDF files that all contain several documents that must be split. I tried to use the Intelligent Keyword Classifier for this but it doesn’t work well, I think this is because it automatically selects keywords that are not relevant.
In fact, my problem should be easy because each document has a title on page 1 that identifies the document type. If you look at the attached PDF (8 pages), it includes 4 documents of 2 pages each, as follows:

  • Page 1-2 is a “demande de versement de subvention” identified with the title on top of page 1
  • Page 3-4 is a “liste des stagiaires” (list of internships) identified with this title on top of page 3
  • Page 5-6 is a “compte rendu financier” (financial report) identified with this title on top of page 5
  • Page 7-8 is a liste of physicists identified with the “AIDE-SOIGNANT” keyword on top of page 7

What I would like to do is this: define a list of keywords that identify each document type without a doubt, then having the classifier search for these keywords. If found, it means it found the first page of a new document type and requires a separation. If not, the page is considered as the next page of the latest identified document type.
Unfortunately our classifier does not work this way. On the other hand, Kofax works exactly like this, and we are in competition with them on this deal. So the customer is happy with what Kofax provides in terms of classification and splitting, but knows we are much better regarding extraction.

Did I miss something here? Do you think we can improve classification/splitting to provide best of both worlds to our customer?
Please let me know.

Thanks,
Eric

SKM_C45823120815310.pdf (715.8 KB)

Does it have book marks splitting each section? Like Below
image

No, no bookmark but titles on each page 1 that can be detected to identify a new document type in the page flow. Not sure I am clear, please let me know…

I do not know if this si what you need but I can see folloving solution:

  • split PDF per page (in you example 8 separate PDFs)
  • starting from LAST page perform classification:
    → if not classified the page bellogs to a document not yet identified, continue with next page forward
    → if classified you found first page of the document, merge all so far processed pages into one document

Cheers

That’s an interesting solution, which means I would classify each page independently (8 classifications for my sample PDF here). Do you think I should use the Keyword Based Classifier for this?

This solution is more complex than having it packaged into an existing classifier but I think it should work, let me give it a try.

Thank you !!

Well, so far I did not need to involve UiPath classification mechanism in my workflows, so not much experience with this. It was always enough to use just regex search.

Would be interesting to hear back of what solution worked for you.

Cheers

I did it and it works perfectly, as long as the titles that identify each document type is read correctly. The OCR is quite good so it works fine. But I’d rather have this logic implemented in an existing classifier.
Do you know if an ML classifier can be used for this? I never used any…
If so, I can ask my customer to provide me more sample documents for a proper training.

[EDIT] I discussed this with my expert colleague Sofiene Jenzri and he told me the ML classifier does not do the split, only the IKC can do it.

As my custom classifier/splitter works quite well, do you think it is possible to create my own classification result (using the ClassificationResult class) in order to have the classification station working?
Did some of you already did this?

[EDIT] I did it, it works perfectly. My custom classifier/splitter returns a ClassificationResult object that makes the classification station usable. This is nice :slight_smile: