Split pdf based on some pattern unique to first page

Kiran_A · October 17, 2024, 4:09am

Hello All,

I have a scanned pdf which basically is a merged Purchase orders. I want to split it based on the PO number which appears in the first page of each PO.

I have already gone through the existing threads where community recommended to use bala reva activities but they are not compatible in windows projects.

I am currently looping through each page, looking for the PO and splitting the pdf, however this process is taking so much time as i am reading each page using OCR.

Request to suggest a better/efficient approach to handle such pdfs.

Thanks.

Raymond_Hui · October 17, 2024, 4:11am

what kind of pdf it is?

Image or digital copy?

If you are using read pdf text it should be fast enough

Kiran_A · October 17, 2024, 4:50am

It is not a native pdf, it is a scanned pdf.

i am using Read Pdf with OCR which is taking a lot of time.

Thanks.

Anil_G · October 17, 2024, 5:10am

@Kiran_A

Ideally that would be the way…if not if using document understanding then can go with this approach below… @postwick has put it together well

Cheers

Kiran_A · October 17, 2024, 6:54am

Thanks @Anil_G ,

However we are not using action center in our framework, we are using regex to validate the ML output.

i am looking for an efficient solution even if it involves integrating third party service.

Thanks.

tomasz.wierzbicki · October 20, 2024, 10:07am

Hi @Kiran_A,

here are 3 options for you to explore:

there was an ML Splitter, but it’s deprecated, however AI Center still has a ML Package called DocumentSplitter under ML Packages / Out of the box Packages / UiPath Document Understanding / DocumentSplitter (but here, I’m also afraid it’s gonna split on page breaks)
UiPath is planning some splitting capabilities in DU modern projects by 24.10 - you might want to keep an eye on that
Python has pdf handling libraries, I used pymupdf to slice pages - since you always want to cut out based on PO number and you get coordinates from either pymupdf methods or from DU’s DOM (Document Object Model) then you can use these coordinates for slicing up 1 document into several images. Stability of this is questionable, but hey, quid pro quo

Cheers,
Tom

Topic		Replies	Views
Is it possible to split the document by using ml classifier - Document understanding Studio studio , question , new_feature_request	21	1423	August 23, 2023
Split PDF file into many files based on a specific text. PDF file consists of images Help ocr , activities	6	5384	February 20, 2020
How to split consolidated multiple pages scanned pdf which contains different types of documents Something Else feedback	5	1641	May 25, 2022
How to classify only the required page( having purchase order details) and send it to AI Center in Document understanding section Document Understanding document_understanding , intelligent-keyword-classifier , pdf-split	4	635	May 25, 2023
How to seperate PDF page or split if text is detected? Studio excel , selector , uiautomation , robot , activities , studio , question	7	1632	May 28, 2021

Most Active Users - Yesterday
ashokkarale
Anil_G
Ruban_shanmugam
Lalit_Chaudhari
eyashb
sonaliaggarwal47
PWilliams
AzeemK
Juan_Hkahfi
More details...

Split pdf based on some pattern unique to first page

Related topics