How to split the pdf file basis on text name

Hi,
How to Split the pdf files basis on text (for ex 30 pages in some page i want text name 2-4 one pdf file and 5 -9 another pdf file basis on text name )

Thanks,

Hi @upendra_koneru,

Please help me understand better what you need.

  1. Do you want to split 1 pdf file into multiple pdf files ?
  2. The splitting of pdf files are based some texts, that is if the text for a range of pages matches your criteria, you want to extract to 1 pdf file ?

@GreenTea
please find the answer below

  1. Yes (i have single pdf file with multiple pages, i want to convert to multiple pdf files )
  2. I am search for text in pdf. for example pages on 2,7,10 text is there .
    i want to create pdf file like 2 to 6 one pdf file, 7 to 9 another pdf file and 10 to last page another pdf.

Thanks,

Hi @upendra_koneru,

This use case is tricky. The example provided is dependent on how well the RegEx pattern is crafted by you. Since I do not have the pdf file, you have to provide anchors and ensure the correct page boundaries are identified - not just the key words.

The idea:

  1. Read pdf page by page with activity Read PDF Text
  2. Search the text string with activity IsMatch
  3. If a match (Boolean) is found, add a datarow containing the search text and starting page number
  4. increment page number
  5. repeat step 2
  6. If the second page is read, update the previous datarow ending page number
  7. When last page is read, update the datarow ending page number
  8. Finally Extract PDF Page Range to extract the pages.

Note: activity Assign Regex Pattern is to replace a space with \s for regular expression to work correctly. You will need to change it accordingly for the text you are searching

The example contains a sample pdf which you can test to verify the workings…
PDFExtract.zip (102.5 KB)

Hi @GreenTea
Thanks for your response
Can have the regex pattern for the value " **AER = Adverse Reaction Report "

Note: Text value is constant

Thanks,

Hi @upendra_koneru,

Please try this

(?=**AER\p{Zs}=)([aA-zZ\p{Zs}\p{Po}\p{S}]+)\p{Zs}(?!Report)

Hi @GreenTea
Above code is not working for reference i am sending screenshot

image
Thanks,

Hi @upendra_koneru,

Sorry, I see the issue. I have to enter two backslash \\ in the Forum to make the single backslash \ appeared. Follow the value shown in the image

(?=\*\*AER\p{Zs}=)([aA-zZ\p{Zs}\p{Po}\p{S}]+)\p{Zs}(?!Report)

Please use the 'UiPath.PDF.Activities.PDF.ExtractPDFPageRange` for slitting PDF files from bigger file

Extracts a specified range of pages from a PDF document.

Properties

  • DisplayName - The display name of the activity.
  • FileName - The path of the PDF file you want to extract a range of pages from. This field supports only strings and String variables.
  • OutputFileName - The name you want to use for the file that is generated from the extracted range of pages. This field supports only strings and String variables.
  • Password - The password of the PDF file, if necessary. This field supports only strings and String variables.

Input

  • Range - The range of pages that you want to retrieve. You can specify a single page (e.g. “7”), a range of pages (e.g. “7-12”), or a complex range, (e.g. “2-5, 7, 15-End” or “All”). Only string variables and strings are supported. By default, this field is cleared.

This was helpful, but i have lil different usecase where i need to split pdf from main PDF where range is like inbetween of two fixed string. 1st page has word “Page1” and 5th page has word “Page1” i need to split pdf from 1-4. This has to be done for 300 page pdf. Range is not constant. Pls help

Hi @kartik_m - I have developed a code which does exactly the same stuff…

This program looks for the text and when there is a match it create a string and I am keep building the string as shown below…and then finally pass it to PDF Splitter (Bala Reva) or You can use Extract PDF Range activity also…

image

so this will split the pages in one shot as shown below…

image

Let me know, are you looking for something like this…

Hi @kartik_m

Since @prasath17 has a solution already, give it a try.

May i know how you are building string, If you share me xml of same, it would be great help.

@GreenTea is there a way to achieve using matches or regex?

@kartik_m - Please find the attached workflow
Split_PDF_Match.zip (761.6 KB)

If run this as is, you will see 3 files created under splitted Files folder.

Hope this helps…

Yes @prasath17 it is almost helped but strucked in one of logic where,
Page1
Page2- has word “Last page”
Document 1
Document 2
Page1- has the word “Last page”
Document1
Page1- has the word “Last page”
Document1
This is my main pdf format, and i need to split first 4, next 2 and 2. so i am referring the word “Last Page” and working out, but when page 2 has the reference word. there i am struck where i miss page1 in the split pdf. it considers only 2nd page as per our logic.

@kartik_m - I got it.Let me see if I can tweak the code…

image

Yes @prasath17 Please, And i cant take page1 as reference as even “Document1, Document2” has the word Page1. so to identify slot we need to go with word Last page.
And pls suggest how to name the split pdfs. I need to name individual pdf with “Data within it” like if page1 has invoice no in it. spit pdf name should be named as “InvoiceNo.pdf”

@kartik_m - I guess it is fixed…Now the text which I am looking for is found on page 2 and Page 6…so below are my page split…

image

Updated xaml:
Main.xaml (38.2 KB)

Please let me know if this is working for you…

@prasath17 , its working for first slot, not working for the inbetween slots.
Page1,- Getting this included
Page2,
Doc,
Doc,
Page1,
Doc,
Page1, - missing this again and making entire range disturbed
Page2,
Doc
Doc,
can we do like “if ‘page2’ word found” then (initialRange-1).