Is it possible to split the document by using ml classifier - Document understanding

Smitesh_Aher1 · August 22, 2023, 11:28am

Hi Team,

I want to split the pdf by automation like,

Scenario:- Inside the one pdf there are multiple pdf’s present. So, i want to split the inside pdf’s on the basis only by unique header not on basis of page no. because page number are not available on pages.

For Eg:-80 pages pdf split like below,

1st to 10th page = 1st pdf(Unique text present on 1st page)
11th to 25th page = 2nd pdf(Unique text present on 2nd page)
26th to 30th page = 3rd pdf(Unique text present on 3rd page)
31th to 50th page = 4th pdf(Unique text present on 4th page)

Need to split pdf’s by automation and should be save every split pdf separately in one folder.

By this way i want to split pdf. Also, let me know if we can do splitting of the pdf in other way?

Thanks & Regards,
Smitesh.

Smitesh_Aher1 · August 22, 2023, 11:40am

Sorry @Praveen_Mudhiraj … It’s confidential document.

PDF Like,

Header:- Feedback

‘A’ Pdf(This is one big pdf) – (1,2,3,4,5,6,7,…80) pages are present.

1(Header is present on 1st page),2,3,4 ------One pdf
5(Header is present on 5th page)6,7,8,…------Second pdf.

Want to split pdf like this way.

We can split this pdf by using ‘pdf extract range’ activity by manually. but, i want through automation.

Praveen_Mudhiraj · August 22, 2023, 11:45am

Can you try this xaml

PDF.zip (1.8 KB)

@Smitesh_Aher1

Smitesh_Aher1 · August 22, 2023, 12:10pm

I tried this already not working.

postwick · August 22, 2023, 12:37pm

The classifier doesn’t split the pages, but it gives you output that tells you which pages each document is. Then you use the Extract PDF Range activity to separate it into multiple documents.

Here I am looping through the classification action’s output:

TypeArgument is UiPath.DocumentProcessing.Contracts.Results.ClassificationResult

Then inside the loop you use Extract PDF Range and provide it the Range based on the classification output:

Range property: (item.DocumentBounds.StartPage + 1).ToString + “-” + (item.DocumentBounds.StartPage + item.DocumentBounds.PageCount).ToString

Smitesh_Aher1 · August 22, 2023, 2:06pm

Thanks @postwick for providing solution.

If possible can you please share me workflow for better understanding?

Thanks & Regards,
Smitesh

postwick · August 22, 2023, 2:09pm

I cannot send you our workflow due to compliance issues.

What more do you need to know?

Smitesh_Aher1 · August 22, 2023, 2:13pm

Okay @postwick NP.

I am not getting this,
1)validatedClassificationResults

2)File Name–> not getting this.(Pdf path?)

3)OutPut file name—> not getting this

Please let me know about this.
I am beginner in the UiPath.

postwick · August 22, 2023, 2:17pm

This comes from the Wait for Document Classification Action (Action Center)

It’s the filename of the original PDF file that you need to split up.

This is the filename of the document (only certain pages) that is extracted from the original PDF. We are using taxonomy information from the classification steps but you can formulate your own new filename any way you want.

The expression in Extract PDF Range is:
Path.Combine(loanPackageFolder,itemFilename)

This is how itemFilename is built:
Now.ToString("yyyy-MM-dd") + " " + FormTaskObj.GetDataJsonObject("Application").ToString + " " + FormTaskObj.GetDataJsonObject("loanNumber").ToString + " " + itemFilenameAbbreviation + Path.GetExtension(sourceFileInfo.FullName)

FormTaskObj comes from the Wait for Form Task and Resume activity:

postwick · August 22, 2023, 2:22pm

A high level overview of our process:

Collect PDF file from a folder
Create Form Action in Action Center - a user then enters information like Customer Name, Customer Number, Loan Number, etc from the document
Create Document Classification Action in Action Center - here the user splits up the original document and classifies each sub-document using the taxonomy
Split the original document and save as individual documents per the output of the classification action

Smitesh_Aher1 · August 22, 2023, 2:23pm

Thank You @postwick .

I’ll work on this solution and will update you if it’s working successfully.

Thanks & Regards,
Smitesh.

Smitesh_Aher1 · August 23, 2023, 5:01am

Good Morning @postwick

The expression in Extract PDF Range is:
Path.Combine(loanPackageFolder,itemFilename)------>I am not understand this loanpackageFolder ?(LoanPackageFolder means my pdf right?)

I didn’t get following point. can you please elaborate this?

Now.ToString(“yyyy-MM-dd”)

FormTaskObj.GetDataJsonObject(“Application”).ToString

FormTaskObj.GetDataJsonObject(“loanNumber”).ToString

itemFilenameAbbreviation

Path.GetExtension(sourceFileInfo.FullName)

postwick · August 23, 2023, 12:32pm

It’s just a variable containing a path to where we want to save the new PDF files.

This just gives the current date in yyyy-MM-dd format.

FormTaskObj is the output object from the Form Action where the user types in things like Customer Name, Loan Number, Application Number, etc. We use these things to name the new files.

We have a spreadsheet that contains abbreviations for the taxonomy Document Names, because the actual data from the taxonomy is too long for filenames.

This gets the extension of the original file. I get sourceFileInfo with the Get File Info activity.

Smitesh_Aher1 · August 23, 2023, 12:53pm

Thanks @postwick

Now i got it. But i think no need of ‘wait for form task and resume’ activity for splitting the pdf.

what i want just split the pdf and store in folder one by one. so, for this is it needed?

what your suggestion on this?

postwick · August 23, 2023, 12:57pm

Correct, the Form Task is just our step to have them enter general information about the documents in the PDF, information that applies to all the individual documents.

Smitesh_Aher1 · August 23, 2023, 1:10pm

Cheers @postwick .

Means itemfilename will look like,

itemfilename = Now.ToString(“yyyy-MM-dd”) + itemFilenameAbbreviation + Path.GetExtension(sourceFileInfo.FullName)

i think for splitting no need of ‘abbreviations’.
am i right? @postwick

postwick · August 23, 2023, 1:54pm

Correct but of course you need to create a unique filename for each split PDF.

Smitesh_Aher1 · August 23, 2023, 5:28pm

Yes right i need unique name for each file.

abbreviations - @postwick actually i don’t have idea about this. can you please give idea or tell me how it work.

is it need to change in the below range or can i put as it is?
range:- (item.DocumentBounds.StartPage + 1).ToString + “-” + (item.DocumentBounds.StartPage + item.DocumentBounds.PageCount).ToString

Please help me for this 2 points. it’s a big help for me.

Thanks & Regards,
Smitesh

postwick · August 23, 2023, 5:39pm

We have a spreadsheet where we define the abbreviation for each taxonomy item.

I Read Range that into a datatable, and then based on the classification result use Lookup Data Table to get the Filename Abbreviation to use for the file.

is it need to change in the below range or can i put as it is?
range:- (item.DocumentBounds.StartPage + 1).ToString + “-” + (item.DocumentBounds.StartPage + item.DocumentBounds.PageCount).ToString

It should work as-is, as long as your For Each element is item. If it’s currentItem or something else, just change item to currentItem or whatever you’ve used in your For Each.

But aside from that, the properties of the classification object will be the same ie DocumentBounds.StartPage, DocumentBounds.PageCount

Smitesh_Aher1 · August 23, 2023, 6:02pm

But @postwick my case is different i am not using any spreadsheet for saving pdf file name.

I want to save the pdf with person name which present below header of pdf. So, is there any way can do this?

Thanks & Regards,
Smitesh

Topic		Replies	Views
How to use Splitting option from the intelligent keyword classifier activity Activities activities , question , document_understanding	55	1430	August 29, 2023
Automation Cloud Document Understanding page based classification Document Understanding	5	99	January 20, 2025
Split pdf based on some pattern unique to first page Document Understanding split-pdf	5	114	October 20, 2024
Document Understanding: Splitting in Classic project AI Center question , document_understanding , ai_center , classic-project , splitting	5	151	June 20, 2024
How to split a pdf into multiple documents before using document understanding? Automation Suite question , automation_suite	4	76	April 21, 2025

Is it possible to split the document by using ml classifier - Document understanding

Related topics