Split the pdf into seperate pdfs based on the unique value in the pdfs

Hello All,

I am trying to separate the pdfs based on the certificate value present in the pdf

So this pdf has 8 pages, we have unique certificate values in 4 pages, so accordingly need to split into 4 pdfs, which is requirement.

Need help and suggestions to get this logic done.

Thanks.

Hi @sushmitha.e

You can classify the pdf document using the certificate number and then Split it
Check the thread below

Hope this helps!

Hi @sushmitha.e

use read pdf with ocr or read pdf text to extract text from each page, then identify the certificate value using regex or string matching. loop through pages, group pages by certificate value, and use split pdf .

Happy Automation

Hello @prashant1603765 , thanks for reply.

Currently I have implemented this logic can see in this xaml file.

Test.xaml (10.9 KB)

The output I am getting is like 8 pdfs, since it has 8 certificate no’s.

But in the pdf, we have 8 certificate no’s but 1to2pages it has same certificate no accordingly can look into this pdf below.

Federal MIchigan Corey Certs.pdf (1.6 MB)

So in that way we have in total 4 certificate no’s and the result need to get is 4 separate pdfs.

Am I missing the logic somewhere. Pls help me out to resolve it.

Thanks.

Hey @sushmitha.e , You don’t have logic to check if certificate values are duplicates. Right now your logic identifies each regex match as a new certificate.

You need to get the certificate numbers and the page reference where those certificate numbers are occurring. Then compare the certificate numbers. If they are the same, just change the page range to the new page or retain the old page number.

You may leverage a library I shared some time ago that allows you to classify based on expressions with positive and negative score:

It may be helpful if you need to determine for example “Page 1” and other expressions to determine first page, and you can also include negative expressions in case you see undesired results

Hope it helps