Uipath - Document Understanding- classification

how can we identify if pdf has multiple pages with multiple occurrences of same fields across different pages in UiPath after classification step, as this needs to be rejected as per business requirement right after classification

After the classification step use the Extract Data Documnet activity.

The Extraction Results object already tracks the page number for every detected field occurrence.

Simply check the results for the target field: if it has multiple occurrences, and those occurrences are on different pages, reject.

hi, @TheBOT Extract PDF Page Range activity is handy here. You can use it to split the PDF into pages or page ranges after classification.

Then for each page or range, extract the fields with your extractor or OCR.

Check if same fields appear in multiple pages by comparing extracted data.

If duplicates found across pages, reject it immediately.

Using Extract PDF Page Range helps you handle pages individually, which makes the check easier and cleaner.

hey thanks Yash, but how can we get the page occurrences from extraction results of the fields, can you please elaborate here with the steps, as i have not done this before, not sure exactly if there is any method to do that

Thanks arjun, but how can we compare if same field appears in multiple pages, please if you can explain in bit detail, that will be really helpful

* Use Extract PDF Page Range to split your PDF into single pages.

* Extract the target field from each page—save each value in a list, like fieldValues.

* To check for duplicates, use this LINQ:

fieldValues.GroupBy(Function(x) x).Any(Function(g) g.Count > 1)

If true, you found a duplicate—reject the PDF.

Or, use a loop too…

okay, but if we split pdf it will consider as separate document, we need to consider all pages/merged pages as one document type and perform this multiple field value check and also is there any way we can do this right after classification step without extraction?

Yes, so you need to treat the whole thing as one doc.

After classification, use Read PDF Text to get all text from every page at once.

Use Regex or string parsing to find all occurrences of your target field in the combined text.

Count how many times each field shows up.

If a field appears more than once, reject doc

This works right after classification and before you start extraction.

It also keeps page context, so you don’t lose track of how many times a value appears in the whole document.

For example, say the field is “Invoice Number”:

System.Text.RegularExpressions.Regex.Matches(pdfText,“Invoice Number: \d+”).Count > 1

If that’s true—there’s a duplicate somewhere in the whole file, and you can safely reject the PDF

Hii @TheBOT

As while doing the extraction use FORM EXTRACTOR OCR in Extract Documentr Data activity which will extract the data from the first page of the pdf as it will automatically skip the other pages if the value is found in the first page and if the value not found in the first page of the pdf then it will look from the second page and goes on till it found the value.

CHEERS!!

okay Thans yash, but the documents are scanned and can be handwritten as well, so will the form extractor work for those? and also since previously you told that the extraction result will have the page occureces of the fields, can you please explain how can we get those page occurences along with fields?

Yes the FORM EXTRACTOR OCR will work for scanned as well as hand written document but the condition is that the hand written document should be neat and clean and the letter of hand written should not be over written so that the accuracy generate for the extraction result will be good and for the page number occurance you have to use DATA EXTRACTION SCOPE after using EXTRACTION SCOPE use the output ExtractionResult. And to get the page number use the expression "field.Value.GetTextPositions().Select(Function(pos) pos.PageIdx + 1)"and try

Hope it helps you
CHEERS!!

okay thanks yash, but may i know what is the pos here, is that a random variable or an output of GetTextPositions() function and also the PageIdx is that a method or again a variable?

As It is the random variable and PageIdx is variable