I have a scanned pdf which basically is a merged Purchase orders. I want to split it based on the PO number which appears in the first page of each PO.
I have already gone through the existing threads where community recommended to use bala reva activities but they are not compatible in windows projects.
I am currently looping through each page, looking for the PO and splitting the pdf, however this process is taking so much time as i am reading each page using OCR.
Request to suggest a better/efficient approach to handle such pdfs.
there was an ML Splitter, but it’s deprecated, however AI Center still has a ML Package called DocumentSplitter under ML Packages / Out of the box Packages / UiPath Document Understanding / DocumentSplitter (but here, I’m also afraid it’s gonna split on page breaks)
UiPath is planning some splitting capabilities in DU modern projects by 24.10 - you might want to keep an eye on that
Python has pdf handling libraries, I used pymupdf to slice pages - since you always want to cut out based on PO number and you get coordinates from either pymupdf methods or from DU’s DOM (Document Object Model) then you can use these coordinates for slicing up 1 document into several images. Stability of this is questionable, but hey, quid pro quo