How to split the pdf file basis on text name

upendra_koneru · May 21, 2020, 3:55am

Hi,
How to Split the pdf files basis on text (for ex 30 pages in some page i want text name 2-4 one pdf file and 5 -9 another pdf file basis on text name )

Thanks,

GreenTea · May 21, 2020, 5:41am

Hi @upendra_koneru,

Please help me understand better what you need.

Do you want to split 1 pdf file into multiple pdf files ?
The splitting of pdf files are based some texts, that is if the text for a range of pages matches your criteria, you want to extract to 1 pdf file ?

upendra_koneru · May 21, 2020, 6:42am

@GreenTea
please find the answer below

Yes (i have single pdf file with multiple pages, i want to convert to multiple pdf files )
I am search for text in pdf. for example pages on 2,7,10 text is there .
i want to create pdf file like 2 to 6 one pdf file, 7 to 9 another pdf file and 10 to last page another pdf.

Thanks,

GreenTea · May 21, 2020, 1:49pm

Hi @upendra_koneru,

This use case is tricky. The example provided is dependent on how well the RegEx pattern is crafted by you. Since I do not have the pdf file, you have to provide anchors and ensure the correct page boundaries are identified - not just the key words.

The idea:

Read pdf page by page with activity Read PDF Text
Search the text string with activity IsMatch
If a match (Boolean) is found, add a datarow containing the search text and starting page number
increment page number
repeat step 2
If the second page is read, update the previous datarow ending page number
When last page is read, update the datarow ending page number
Finally Extract PDF Page Range to extract the pages.

Note: activity Assign Regex Pattern is to replace a space with \s for regular expression to work correctly. You will need to change it accordingly for the text you are searching

The example contains a sample pdf which you can test to verify the workings…
PDFExtract.zip (102.5 KB)

upendra_koneru · May 22, 2020, 4:08am

Hi @GreenTea
Thanks for your response
Can have the regex pattern for the value " **AER = Adverse Reaction Report "

Note: Text value is constant

Thanks,

GreenTea · May 22, 2020, 5:15am

Hi @upendra_koneru,

Please try this

(?=**AER\p{Zs}=)([aA-zZ\p{Zs}\p{Po}\p{S}]+)\p{Zs}(?!Report)

upendra_koneru · May 22, 2020, 8:38am

Hi @GreenTea
Above code is not working for reference i am sending screenshot

Thanks,

GreenTea · May 22, 2020, 5:12pm

Hi @upendra_koneru,

Sorry, I see the issue. I have to enter two backslash \\ in the Forum to make the single backslash \ appeared. Follow the value shown in the image

(?=\*\*AER\p{Zs}=)([aA-zZ\p{Zs}\p{Po}\p{S}]+)\p{Zs}(?!Report)

amittiwari · May 22, 2020, 6:39pm

Please use the 'UiPath.PDF.Activities.PDF.ExtractPDFPageRange` for slitting PDF files from bigger file

Extracts a specified range of pages from a PDF document.

Properties

DisplayName - The display name of the activity.
FileName - The path of the PDF file you want to extract a range of pages from. This field supports only strings and String variables.
OutputFileName - The name you want to use for the file that is generated from the extracted range of pages. This field supports only strings and String variables.
Password - The password of the PDF file, if necessary. This field supports only strings and String variables.

Input

Range - The range of pages that you want to retrieve. You can specify a single page (e.g. “7”), a range of pages (e.g. “7-12”), or a complex range, (e.g. “2-5, 7, 15-End” or “All”). Only string variables and strings are supported. By default, this field is cleared.

kartik_m · May 25, 2021, 4:08am

This was helpful, but i have lil different usecase where i need to split pdf from main PDF where range is like inbetween of two fixed string. 1st page has word “Page1” and 5th page has word “Page1” i need to split pdf from 1-4. This has to be done for 300 page pdf. Range is not constant. Pls help

prasath17 · May 25, 2021, 4:20am

Hi @kartik_m - I have developed a code which does exactly the same stuff…

This program looks for the text and when there is a match it create a string and I am keep building the string as shown below…and then finally pass it to PDF Splitter (Bala Reva) or You can use Extract PDF Range activity also…

so this will split the pages in one shot as shown below…

Let me know, are you looking for something like this…

GreenTea · May 25, 2021, 4:25am

Hi @kartik_m

Since @prasath17 has a solution already, give it a try.

kartik_m · May 25, 2021, 8:53am

May i know how you are building string, If you share me xml of same, it would be great help.

kartik_m · May 25, 2021, 9:18am

@GreenTea is there a way to achieve using matches or regex?

prasath17 · May 25, 2021, 10:57am

@kartik_m - Please find the attached workflow
Split_PDF_Match.zip (761.6 KB)

If run this as is, you will see 3 files created under splitted Files folder.

Hope this helps…

kartik_m · May 25, 2021, 11:37am

Yes @prasath17 it is almost helped but strucked in one of logic where,
Page1
Page2- has word “Last page”
Document 1
Document 2
Page1- has the word “Last page”
Document1
Page1- has the word “Last page”
Document1
This is my main pdf format, and i need to split first 4, next 2 and 2. so i am referring the word “Last Page” and working out, but when page 2 has the reference word. there i am struck where i miss page1 in the split pdf. it considers only 2nd page as per our logic.

prasath17 · May 25, 2021, 12:13pm

@kartik_m - I got it.Let me see if I can tweak the code…

kartik_m · May 25, 2021, 12:18pm

Yes @prasath17 Please, And i cant take page1 as reference as even “Document1, Document2” has the word Page1. so to identify slot we need to go with word Last page.
And pls suggest how to name the split pdfs. I need to name individual pdf with “Data within it” like if page1 has invoice no in it. spit pdf name should be named as “InvoiceNo.pdf”

prasath17 · May 25, 2021, 12:26pm

@kartik_m - I guess it is fixed…Now the text which I am looking for is found on page 2 and Page 6…so below are my page split…

Updated xaml:
Main.xaml (38.2 KB)

Please let me know if this is working for you…

kartik_m · May 25, 2021, 1:04pm

@prasath17 , its working for first slot, not working for the inbetween slots.
Page1,- Getting this included
Page2,
Doc,
Doc,
Page1,
Doc,
Page1, - missing this again and making entire range disturbed
Page2,
Doc
Doc,
can we do like “if ‘page2’ word found” then (initialRange-1).

Topic		Replies	Views
Want to split the pdf file basis on text name Studio studio , question , new_feature_request	11	727	August 21, 2023
How split pdf file into many files based on specific text? Studio studio , question , activities_panel	1	799	January 5, 2023
Split pdf based on a word Studio studio , question , activities_panel	4	805	August 15, 2022
How to split pdf acording to word Studio studio , question , activities_panel	4	729	June 27, 2022
I need to split pdf into multiple pdfs. i had no page numbers in it.Based on the text i need to split into multiple pdfs.can any one help?I had extracted pdf data and tyring to split by regex Activities pdf , activities , studio	23	2276	March 3, 2022

How to split the pdf file basis on text name

Properties

Input

Related topics