Split pdf based on some pattern unique to first page

Hello All,

I have a scanned pdf which basically is a merged Purchase orders. I want to split it based on the PO number which appears in the first page of each PO.

I have already gone through the existing threads where community recommended to use bala reva activities but they are not compatible in windows projects.

I am currently looping through each page, looking for the PO and splitting the pdf, however this process is taking so much time as i am reading each page using OCR.

Request to suggest a better/efficient approach to handle such pdfs.

Thanks.

what kind of pdf it is?

Image or digital copy?

If you are using read pdf text it should be fast enough

It is not a native pdf, it is a scanned pdf.

i am using Read Pdf with OCR which is taking a lot of time.

Thanks.

@Kiran_A

Ideally that would be the way…if not if using document understanding then can go with this approach below… @postwick has put it together well

Cheers

Thanks @Anil_G ,

However we are not using action center in our framework, we are using regex to validate the ML output.

i am looking for an efficient solution even if it involves integrating third party service.

Thanks.

1 Like

Hi @Kiran_A,

here are 3 options for you to explore:

  • there was an ML Splitter, but it’s deprecated, however AI Center still has a ML Package called DocumentSplitter under ML Packages / Out of the box Packages / UiPath Document Understanding / DocumentSplitter (but here, I’m also afraid it’s gonna split on page breaks)
  • UiPath is planning some splitting capabilities in DU modern projects by 24.10 - you might want to keep an eye on that
  • Python has pdf handling libraries, I used pymupdf to slice pages - since you always want to cut out based on PO number and you get coordinates from either pymupdf methods or from DU’s DOM (Document Object Model) then you can use these coordinates for slicing up 1 document into several images. Stability of this is questionable, but hey, quid pro quo :wink:

Cheers,
Tom