Scenario:- Inside the one pdf there are multiple pdf’s present. So, i want to split the inside pdf’s on the basis only by unique header not on basis of page no. because page number are not available on pages.
For Eg:-80 pages pdf split like below,
1st to 10th page = 1st pdf(Unique text present on 1st page)
11th to 25th page = 2nd pdf(Unique text present on 2nd page)
26th to 30th page = 3rd pdf(Unique text present on 3rd page)
31th to 50th page = 4th pdf(Unique text present on 4th page)
Need to split pdf’s by automation and should be save every split pdf separately in one folder.
By this way i want to split pdf. Also, let me know if we can do splitting of the pdf in other way?
The classifier doesn’t split the pages, but it gives you output that tells you which pages each document is. Then you use the Extract PDF Range activity to separate it into multiple documents.
Here I am looping through the classification action’s output:
TypeArgument is UiPath.DocumentProcessing.Contracts.Results.ClassificationResult
Then inside the loop you use Extract PDF Range and provide it the Range based on the classification output:
This comes from the Wait for Document Classification Action (Action Center)
It’s the filename of the original PDF file that you need to split up.
This is the filename of the document (only certain pages) that is extracted from the original PDF. We are using taxonomy information from the classification steps but you can formulate your own new filename any way you want.
The expression in Extract PDF Range is: Path.Combine(loanPackageFolder,itemFilename)
This is how itemFilename is built: Now.ToString("yyyy-MM-dd") + " " + FormTaskObj.GetDataJsonObject("Application").ToString + " " + FormTaskObj.GetDataJsonObject("loanNumber").ToString + " " + itemFilenameAbbreviation + Path.GetExtension(sourceFileInfo.FullName)
FormTaskObj comes from the Wait for Form Task and Resume activity:
Create Form Action in Action Center - a user then enters information like Customer Name, Customer Number, Loan Number, etc from the document
Create Document Classification Action in Action Center - here the user splits up the original document and classifies each sub-document using the taxonomy
Split the original document and save as individual documents per the output of the classification action
The expression in Extract PDF Range is: Path.Combine(loanPackageFolder,itemFilename)------>I am not understand this loanpackageFolder ?(LoanPackageFolder means my pdf right?)
I didn’t get following point. can you please elaborate this?
It’s just a variable containing a path to where we want to save the new PDF files.
This just gives the current date in yyyy-MM-dd format.
FormTaskObj is the output object from the Form Action where the user types in things like Customer Name, Loan Number, Application Number, etc. We use these things to name the new files.
We have a spreadsheet that contains abbreviations for the taxonomy Document Names, because the actual data from the taxonomy is too long for filenames.
This gets the extension of the original file. I get sourceFileInfo with the Get File Info activity.
Correct, the Form Task is just our step to have them enter general information about the documents in the PDF, information that applies to all the individual documents.
abbreviations - @postwick actually i don’t have idea about this. can you please give idea or tell me how it work.
is it need to change in the below range or can i put as it is? range:- (item.DocumentBounds.StartPage + 1).ToString + “-” + (item.DocumentBounds.StartPage + item.DocumentBounds.PageCount).ToString
Please help me for this 2 points. it’s a big help for me.
I Read Range that into a datatable, and then based on the classification result use Lookup Data Table to get the Filename Abbreviation to use for the file.
is it need to change in the below range or can i put as it is? range:- (item.DocumentBounds.StartPage + 1).ToString + “-” + (item.DocumentBounds.StartPage + item.DocumentBounds.PageCount).ToString
It should work as-is, as long as your For Each element is item. If it’s currentItem or something else, just change item to currentItem or whatever you’ve used in your For Each.
But aside from that, the properties of the classification object will be the same ie DocumentBounds.StartPage, DocumentBounds.PageCount