DU License consuming - limit for pages with extracted information

Im extracting information from a pdf file. In some cases the pdf has more than one page and the extracted information is not in all pages, but I dont have a pattern for this. I saw that in this cases the framework is processing all files and consuming DU license. Is it possible to indicate what pages I want to extract? Im using only machine learning extractor.

The Machine Learning Extractor consumes one unit/processed page, even if the extracted information is not to be found on that page.

You can use keyword based classifier to identify keywords of pages/pdfs you want to extract. You can filter the pdfs and send it across to the ML extractor and this might help you DU license consumption.

More details on how extractions are charges can be found below:

NIce @sharon.palawandram ! Im already using intelligent keyword classifier, but seems it is using for classify the document as a whole and generating an unique confidence level. How can I make it work in a page level? Is it possible?

I see what you’re saying. UiPath gives document level overall confidence percentages in extraction and classification.

If you need page level confidence levels you will have to split the document beforehand. Intelligent Keyword classifier, classifies a document as you define it in taxonomy. Now you have the option to split it in present classification station, but if you need page level metrices, you will have to send pages individually.

very insightful! It came to me as a possibility to split the file before the keyword classifier. I really dont know how it is going to influence the machine learning model, but its something Im going to test for sure! Im going to wait a little for other comments to see if someone has a different perspective!

ofcourse. How have you trained filed in ML extractor? were they full documents or split?

I trained with full documents.

Another idea that I want to test is the keyword classifier trainer. Im going to check how is the learning of this tool.


Awesome, If you trained the ML model with full documents, it might not extract single page documents unless you retrain them.

1 Like