Classification and Splitting

eric.marciano · January 22, 2024, 10:20am

Hello,

My customer sent me 4 sample PDF files that all contain several documents that must be split. I tried to use the Intelligent Keyword Classifier for this but it doesn’t work well, I think this is because it automatically selects keywords that are not relevant.
In fact, my problem should be easy because each document has a title on page 1 that identifies the document type. If you look at the attached PDF (8 pages), it includes 4 documents of 2 pages each, as follows:

Page 1-2 is a “demande de versement de subvention” identified with the title on top of page 1
Page 3-4 is a “liste des stagiaires” (list of internships) identified with this title on top of page 3
Page 5-6 is a “compte rendu financier” (financial report) identified with this title on top of page 5
Page 7-8 is a liste of physicists identified with the “AIDE-SOIGNANT” keyword on top of page 7

What I would like to do is this: define a list of keywords that identify each document type without a doubt, then having the classifier search for these keywords. If found, it means it found the first page of a new document type and requires a separation. If not, the page is considered as the next page of the latest identified document type.
Unfortunately our classifier does not work this way. On the other hand, Kofax works exactly like this, and we are in competition with them on this deal. So the customer is happy with what Kofax provides in terms of classification and splitting, but knows we are much better regarding extraction.

Did I miss something here? Do you think we can improve classification/splitting to provide best of both worlds to our customer?
Please let me know.

Thanks,
Eric

SKM_C45823120815310.pdf (715.8 KB)

rmorgan · January 22, 2024, 10:25am

Does it have book marks splitting each section? Like Below

eric.marciano · January 22, 2024, 10:27am

No, no bookmark but titles on each page 1 that can be detected to identify a new document type in the page flow. Not sure I am clear, please let me know…

J0ska · January 22, 2024, 10:41am

I do not know if this si what you need but I can see folloving solution:

split PDF per page (in you example 8 separate PDFs)
starting from LAST page perform classification:
→ if not classified the page bellogs to a document not yet identified, continue with next page forward
→ if classified you found first page of the document, merge all so far processed pages into one document

Cheers

eric.marciano · January 22, 2024, 10:46am

That’s an interesting solution, which means I would classify each page independently (8 classifications for my sample PDF here). Do you think I should use the Keyword Based Classifier for this?

This solution is more complex than having it packaged into an existing classifier but I think it should work, let me give it a try.

Thank you !!

J0ska · January 22, 2024, 10:55am

Well, so far I did not need to involve UiPath classification mechanism in my workflows, so not much experience with this. It was always enough to use just regex search.

Would be interesting to hear back of what solution worked for you.

Cheers

eric.marciano · January 22, 2024, 4:03pm

I did it and it works perfectly, as long as the titles that identify each document type is read correctly. The OCR is quite good so it works fine. But I’d rather have this logic implemented in an existing classifier.
Do you know if an ML classifier can be used for this? I never used any…
If so, I can ask my customer to provide me more sample documents for a proper training.

[EDIT] I discussed this with my expert colleague Sofiene Jenzri and he told me the ML classifier does not do the split, only the IKC can do it.

eric.marciano · January 23, 2024, 9:50am

As my custom classifier/splitter works quite well, do you think it is possible to create my own classification result (using the ClassificationResult class) in order to have the classification station working?
Did some of you already did this?

[EDIT] I did it, it works perfectly. My custom classifier/splitter returns a ClassificationResult object that makes the classification station usable. This is nice

system · February 28, 2025, 9:21am

This topic was automatically closed 3 days after the last reply. New replies are no longer allowed.

Topic		Replies	Views
Problem with classification, Intelligent keyword classifier is splitting my pdf when there is more than 1 page Document Understanding activities , question , document_understanding	2	1156	August 12, 2022
Classification Results dividing one document into multiple documents based on Pages Document Understanding	4	1630	February 8, 2023
Classification Results - Multiple documents in a file - not being classified Document Understanding	3	3073	August 11, 2020
Does the machine learning classifier not support page splitting? Document Understanding document_understanding	2	43	October 28, 2024
Document Understanding: Document Splitting and Other Wonderful Stories :) Document Understanding	65	11335	January 15, 2022

Classification and Splitting

Related topics