Uipath - Document Understanding- classification

TheBOT · October 9, 2025, 11:21am

how can we identify if pdf has multiple pages with multiple occurrences of same fields across different pages in UiPath after classification step, as this needs to be rejected as per business requirement right after classification

yash.bidaye · October 9, 2025, 11:41am

After the classification step use the Extract Data Documnet activity.

The Extraction Results object already tracks the page number for every detected field occurrence.

Simply check the results for the target field: if it has multiple occurrences, and those occurrences are on different pages, reject.

arjun.shiroya · October 9, 2025, 12:14pm

hi, @TheBOT Extract PDF Page Range activity is handy here. You can use it to split the PDF into pages or page ranges after classification.

Then for each page or range, extract the fields with your extractor or OCR.

Check if same fields appear in multiple pages by comparing extracted data.

If duplicates found across pages, reject it immediately.

Using Extract PDF Page Range helps you handle pages individually, which makes the check easier and cleaner.

TheBOT · October 9, 2025, 2:02pm

hey thanks Yash, but how can we get the page occurrences from extraction results of the fields, can you please elaborate here with the steps, as i have not done this before, not sure exactly if there is any method to do that

TheBOT · October 9, 2025, 2:06pm

Thanks arjun, but how can we compare if same field appears in multiple pages, please if you can explain in bit detail, that will be really helpful

arjun.shiroya · October 9, 2025, 2:46pm

* Use Extract PDF Page Range to split your PDF into single pages.

* Extract the target field from each page—save each value in a list, like fieldValues.

* To check for duplicates, use this LINQ:

fieldValues.GroupBy(Function(x) x).Any(Function(g) g.Count > 1)

If true, you found a duplicate—reject the PDF.

Or, use a loop too…

TheBOT · October 9, 2025, 3:18pm

okay, but if we split pdf it will consider as separate document, we need to consider all pages/merged pages as one document type and perform this multiple field value check and also is there any way we can do this right after classification step without extraction?

arjun.shiroya · October 9, 2025, 3:34pm

Yes, so you need to treat the whole thing as one doc.

After classification, use Read PDF Text to get all text from every page at once.

Use Regex or string parsing to find all occurrences of your target field in the combined text.

Count how many times each field shows up.

If a field appears more than once, reject doc

This works right after classification and before you start extraction.

It also keeps page context, so you don’t lose track of how many times a value appears in the whole document.

For example, say the field is “Invoice Number”:

System.Text.RegularExpressions.Regex.Matches(pdfText,“Invoice Number: \d+”).Count > 1

If that’s true—there’s a duplicate somewhere in the whole file, and you can safely reject the PDF

yash.bidaye · October 10, 2025, 4:41am

Hii @TheBOT

As while doing the extraction use FORM EXTRACTOR OCR in Extract Documentr Data activity which will extract the data from the first page of the pdf as it will automatically skip the other pages if the value is found in the first page and if the value not found in the first page of the pdf then it will look from the second page and goes on till it found the value.

CHEERS!!

TheBOT · October 10, 2025, 7:18am

okay Thans yash, but the documents are scanned and can be handwritten as well, so will the form extractor work for those? and also since previously you told that the extraction result will have the page occureces of the fields, can you please explain how can we get those page occurences along with fields?

yash.bidaye · October 10, 2025, 7:32am

Yes the FORM EXTRACTOR OCR will work for scanned as well as hand written document but the condition is that the hand written document should be neat and clean and the letter of hand written should not be over written so that the accuracy generate for the extraction result will be good and for the page number occurance you have to use DATA EXTRACTION SCOPE after using EXTRACTION SCOPE use the output ExtractionResult. And to get the page number use the expression "field.Value.GetTextPositions().Select(Function(pos) pos.PageIdx + 1)"and try

Hope it helps you
CHEERS!!

TheBOT · October 10, 2025, 8:59am

okay thanks yash, but may i know what is the pos here, is that a random variable or an output of GetTextPositions() function and also the PageIdx is that a method or again a variable?

yash.bidaye · October 10, 2025, 9:26am

As It is the random variable and PageIdx is variable

Topic		Replies	Views
Multipage single PDF file document Processing Page detection Error for specific page exaction keys AI Center orchestrator , activities , studio , question , document_understanding , ai_center , uipath , ml-model , ml-packages	9	272	June 18, 2024
Automation Cloud Document Understanding page based classification Document Understanding	5	200	January 20, 2025
Classify and extract pages of multi type documents in a single pdf file Studio pdf , activities , data_scraping , question	0	765	May 7, 2020
Document Understanding multi files Studio studio	1	50	July 28, 2024
Data labelling of PDFs containing images of invoices AI Center question , document_understanding , ai_center , mlskill , uipath_ocr_cpu	3	239	April 4, 2024

Uipath - Document Understanding- classification

Related topics