How to split pdf which maybe spread over several pages

Description: I have one folder contains some pdf files, each pdf files contains many pages, some page contain “Voucher NO”, some page did not contain “Voucher NO”, EG: 223344 appear in first page and spread page to 4, 876876 appear in fifth page and spread page to 10

Requirement: I need to read each pdf per page, get string and check whether the string contain “Voucher NO”, and save as pdf per “Voucher NO” which need contain spread pages.
EG: 223344.pdf contain page 1-4, 876876.pdf contain page 5-10

Remark: I read several similar articles in the forum and didn’t find the answer

HI,

Can you share specific sample as file?

Regards,

@Yoichi I had attached it in the post, thanks.

HI,

Thank you for sharing. Probably it’s necessary to use Document Understanding framework.

First, classify document.

Then, extract each item of each top page.

Sorry but it’s difficult to create a sample soon. Can you check the above document for now?

Regards,

@Yoichi

I test and feel use “read pdf with OCR” is ok, can you give some advice per below logic, thanks.

@Yoichi
Now I can get the pdf page string ,I also can get “Voucher No”, just don’t know the logic how to get Dynamic range to split

@Lori

Please look at this for slpitting pdf based on keyword

Cheers

Hi,

Yes, we can use ReadPdfWithOCR. However, as OCR is not 100% accuracy, there might be incorrect characters. In fact, the following will extract pdf pages which you expect but there are some problem because OCR doesn’t get “Voucher No” correctly. For now, can you try the following sample? (It takes 6 or 7 min. in my environment)

Sanple20230113-3L.zip (4.9 MB)

If you can change OCR engine more accurate, it might be improved.

Regards,

@Yoichi

I known this is not accurate, I just use it as demo, if demo pass, maybe we will purchase Abby capture OCR, please note, thanks.

I will try you method now, thanks.

This topic was automatically closed 3 days after the last reply. New replies are no longer allowed.