Is it possible to split the document by using ml classifier - Document understanding

Hi Team,

I want to split the pdf by automation like,

Scenario:- Inside the one pdf there are multiple pdf’s present. So, i want to split the inside pdf’s on the basis only by unique header not on basis of page no. because page number are not available on pages.

For Eg:-80 pages pdf split like below,

1st to 10th page = 1st pdf(Unique text present on 1st page)
11th to 25th page = 2nd pdf(Unique text present on 2nd page)
26th to 30th page = 3rd pdf(Unique text present on 3rd page)
31th to 50th page = 4th pdf(Unique text present on 4th page)

Need to split pdf’s by automation and should be save every split pdf separately in one folder.

By this way i want to split pdf. Also, let me know if we can do splitting of the pdf in other way?

Thanks & Regards,
Smitesh.

Sorry @Praveen_Mudhiraj … It’s confidential document.

PDF Like,

Header:- Feedback

‘A’ Pdf(This is one big pdf) – (1,2,3,4,5,6,7,…80) pages are present.

1(Header is present on 1st page),2,3,4 ------One pdf
5(Header is present on 5th page)6,7,8,…------Second pdf.

Want to split pdf like this way.

We can split this pdf by using ‘pdf extract range’ activity by manually. but, i want through automation.

Can you try this xaml

PDF.zip (1.8 KB)

@Smitesh_Aher1

I tried this already not working.

The classifier doesn’t split the pages, but it gives you output that tells you which pages each document is. Then you use the Extract PDF Range activity to separate it into multiple documents.

Here I am looping through the classification action’s output:

image

TypeArgument is UiPath.DocumentProcessing.Contracts.Results.ClassificationResult

Then inside the loop you use Extract PDF Range and provide it the Range based on the classification output:

image

Range property: (item.DocumentBounds.StartPage + 1).ToString + “-” + (item.DocumentBounds.StartPage + item.DocumentBounds.PageCount).ToString

Thanks @postwick for providing solution.

If possible can you please share me workflow for better understanding?

Thanks & Regards,
Smitesh

I cannot send you our workflow due to compliance issues.

What more do you need to know?

Okay @postwick NP.

I am not getting this,
1)validatedClassificationResults

2)File Name–> not getting this.(Pdf path?)

3)OutPut file name—> not getting this

Please let me know about this.
I am beginner in the UiPath.

This comes from the Wait for Document Classification Action (Action Center)
image

It’s the filename of the original PDF file that you need to split up.

This is the filename of the document (only certain pages) that is extracted from the original PDF. We are using taxonomy information from the classification steps but you can formulate your own new filename any way you want.

The expression in Extract PDF Range is:
Path.Combine(loanPackageFolder,itemFilename)

This is how itemFilename is built:
Now.ToString("yyyy-MM-dd") + " " + FormTaskObj.GetDataJsonObject("Application").ToString + " " + FormTaskObj.GetDataJsonObject("loanNumber").ToString + " " + itemFilenameAbbreviation + Path.GetExtension(sourceFileInfo.FullName)

FormTaskObj comes from the Wait for Form Task and Resume activity:

image

1 Like

A high level overview of our process:

  • Collect PDF file from a folder
  • Create Form Action in Action Center - a user then enters information like Customer Name, Customer Number, Loan Number, etc from the document
  • Create Document Classification Action in Action Center - here the user splits up the original document and classifies each sub-document using the taxonomy
  • Split the original document and save as individual documents per the output of the classification action
1 Like

Thank You @postwick .

I’ll work on this solution and will update you if it’s working successfully.

Thanks & Regards,
Smitesh.

Good Morning @postwick

The expression in Extract PDF Range is:
Path.Combine(loanPackageFolder,itemFilename)------>I am not understand this loanpackageFolder ?(LoanPackageFolder means my pdf right?)

I didn’t get following point. can you please elaborate this?

Now.ToString(“yyyy-MM-dd”)

FormTaskObj.GetDataJsonObject(“Application”).ToString

FormTaskObj.GetDataJsonObject(“loanNumber”).ToString

itemFilenameAbbreviation

Path.GetExtension(sourceFileInfo.FullName)

It’s just a variable containing a path to where we want to save the new PDF files.

This just gives the current date in yyyy-MM-dd format.

FormTaskObj is the output object from the Form Action where the user types in things like Customer Name, Loan Number, Application Number, etc. We use these things to name the new files.

We have a spreadsheet that contains abbreviations for the taxonomy Document Names, because the actual data from the taxonomy is too long for filenames.

This gets the extension of the original file. I get sourceFileInfo with the Get File Info activity.

Thanks @postwick

Now i got it. But i think no need of ‘wait for form task and resume’ activity for splitting the pdf.

what i want just split the pdf and store in folder one by one. so, for this is it needed?

what your suggestion on this?

Correct, the Form Task is just our step to have them enter general information about the documents in the PDF, information that applies to all the individual documents.

Cheers @postwick .

Means itemfilename will look like,

itemfilename = Now.ToString(“yyyy-MM-dd”) + itemFilenameAbbreviation + Path.GetExtension(sourceFileInfo.FullName)

i think for splitting no need of ‘abbreviations’.
am i right? @postwick

Correct but of course you need to create a unique filename for each split PDF.

Yes right i need unique name for each file.

abbreviations - @postwick actually i don’t have idea about this. can you please give idea or tell me how it work.

is it need to change in the below range or can i put as it is?
range:- (item.DocumentBounds.StartPage + 1).ToString + “-” + (item.DocumentBounds.StartPage + item.DocumentBounds.PageCount).ToString

Please help me for this 2 points. it’s a big help for me.

Thanks & Regards,
Smitesh

We have a spreadsheet where we define the abbreviation for each taxonomy item.

I Read Range that into a datatable, and then based on the classification result use Lookup Data Table to get the Filename Abbreviation to use for the file.

is it need to change in the below range or can i put as it is?
range:- (item.DocumentBounds.StartPage + 1).ToString + “-” + (item.DocumentBounds.StartPage + item.DocumentBounds.PageCount).ToString

It should work as-is, as long as your For Each element is item. If it’s currentItem or something else, just change item to currentItem or whatever you’ve used in your For Each.

But aside from that, the properties of the classification object will be the same ie DocumentBounds.StartPage, DocumentBounds.PageCount

But @postwick my case is different i am not using any spreadsheet for saving pdf file name.

I want to save the pdf with person name which present below header of pdf. So, is there any way can do this?

Thanks & Regards,
Smitesh