use read pdf with ocr or read pdf text to extract text from each page, then identify the certificate value using regex or string matching. loop through pages, group pages by certificate value, and use split pdf .
Hey @sushmitha.e , You don’t have logic to check if certificate values are duplicates. Right now your logic identifies each regex match as a new certificate.
You need to get the certificate numbers and the page reference where those certificate numbers are occurring. Then compare the certificate numbers. If they are the same, just change the page range to the new page or retain the old page number.
You may leverage a library I shared some time ago that allows you to classify based on expressions with positive and negative score:
It may be helpful if you need to determine for example “Page 1” and other expressions to determine first page, and you can also include negative expressions in case you see undesired results