How to split pdf which maybe spread over several pages

Lori · January 13, 2023, 1:12am

Description: I have one folder contains some pdf files, each pdf files contains many pages, some page contain “Voucher NO”, some page did not contain “Voucher NO”, EG: 223344 appear in first page and spread page to 4, 876876 appear in fifth page and spread page to 10

Requirement: I need to read each pdf per page, get string and check whether the string contain “Voucher NO”, and save as pdf per “Voucher NO” which need contain spread pages.
EG: 223344.pdf contain page 1-4, 876876.pdf contain page 5-10

Remark: I read several similar articles in the forum and didn’t find the answer

Yoichi · January 13, 2023, 1:16am

HI,

Can you share specific sample as file?

Regards,

Lori · January 13, 2023, 1:22am

@Yoichi I had attached it in the post, thanks.

Yoichi · January 13, 2023, 1:45am

HI,

Thank you for sharing. Probably it’s necessary to use Document Understanding framework.

First, classify document.

Then, extract each item of each top page.

Sorry but it’s difficult to create a sample soon. Can you check the above document for now?

Regards,

Lori · January 13, 2023, 1:52am

@Yoichi

I test and feel use “read pdf with OCR” is ok, can you give some advice per below logic, thanks.

Lori · January 13, 2023, 2:00am

@Yoichi
Now I can get the pdf page string ,I also can get “Voucher No”, just don’t know the logic how to get Dynamic range to split

Anil_G · January 13, 2023, 2:26am

@Lori

Please look at this for slpitting pdf based on keyword

Cheers

Yoichi · January 13, 2023, 3:27am

Hi,

Yes, we can use ReadPdfWithOCR. However, as OCR is not 100% accuracy, there might be incorrect characters. In fact, the following will extract pdf pages which you expect but there are some problem because OCR doesn’t get “Voucher No” correctly. For now, can you try the following sample? (It takes 6 or 7 min. in my environment)

Sanple20230113-3L.zip (4.9 MB)

If you can change OCR engine more accurate, it might be improved.

Regards,

Lori · January 13, 2023, 5:01am

@Yoichi

I known this is not accurate, I just use it as demo, if demo pass, maybe we will purchase Abby capture OCR, please note, thanks.

I will try you method now, thanks.

system · January 16, 2023, 5:02am

This topic was automatically closed 3 days after the last reply. New replies are no longer allowed.

Topic		Replies	Views
I had one pdf with 100 pages which contains data of cars compnay. can i split into 4 documents based on page no and company name? Activities pdf , activities , question	20	1496	March 8, 2022
Split pdf based on some pattern unique to first page Document Understanding split-pdf	5	121	October 20, 2024
How to split pdf acording to word Studio studio , question , activities_panel	4	729	June 27, 2022
Separate single PDF Invoice file to multiple individual files Help pdf , activities , question	5	2938	September 18, 2023
I need to split pdf into multiple pdfs. i had no page numbers in it.Based on the text i need to split into multiple pdfs.can any one help?I had extracted pdf data and tyring to split by regex Activities pdf , activities , studio	23	2276	March 3, 2022

How to split pdf which maybe spread over several pages

Related topics