Join/Split PDF based on Size

@Anshuman_Malik sorry, it was making my head hurt and didn’t get around to it. I thought there might be a fast way to use GroupBy(), however, I couldn’t wrap my head around it. So instead I took an easier approach just to basically loop over the files and group them by taking their sum, then remove those files from the initial array, and repeat.

psuedocode:

Assign files array from directory
Initialize List of files array to add each group to

While files.Count > 0
  Initialize fileGroup
  Initialize sizeSum

  For each f In files array
    If sizeSum + New FileInfo(f).Length <= sizeLimit
      Assign sizeSum = sizeSum + New FileInfo(f).Length
      Assign fileGroup = fileGroup.Concat({f})

  Assign files = files.Where(Function(f) Not fileGroup.Contains(f) ).ToArray
  Add to Collection //adds to fileGroups List

FileGroupsBySize.xaml (13.8 KB)
Here you go as an example. You can also turn this into a re-usable if you want by changing the List variable to an ‘out’ argument, and the directory, pattern, and sizeLimit to ‘in’ arguments.

I placed a Write Line to output the results of the fileGroups List, so you can see how it joined the files together. You can remove the Write Line, as it’s only temporary.

With the List (named fileGroups in this example), you can use in a ForEach loop to upload into a system, add as attachments, or zip or join pdf then upload… so, it’s up to you.

Regards.

2 Likes

Hey @ClaytonM, you are amazing. Thank you so much. I just have one more question. You script picks files by File name not by file size for grouping them.

1st Image is the output I get running your script: (considers the logic based on filename)
image

2nd image is how I wanted. Basically one of the goals is too reduce the number of files and optimizing the files

If this is not doable or too much of an ask I can totally understand. I almost banged my head on the wall at this logic since I am still very new to coding and UiPath.

Yeah, it is pulling in the filenames by alphabetical order. If you want to perform it by filesize order, then you just need to adjust that Assign activity at the beginning that stores the files into an array.

Instead of this:

files = Directory.GetFiles(dir, filePattern).Where(Function(f) New FileInfo(f).Length <= sizeLimit).ToArray

Use this:

files = Directory.GetFiles(dir, filePattern).Where(Function(f) New FileInfo(f).Length <= sizeLimit).OrderBy(Function(f) New FileInfo(f).Length ).ToArray

Hope that works.

Regards

Hi @ClaytonM Thank you that is working.

However when I am using Join PDF activity it gives me errors. See attached workflow.
FileGroupsBySize.xaml (18.9 KB)

Hi.

The ForEach needs to be of type String[], since each item in fileGroups is an array of string. Although, you can technically use an Object as long as you convert it as you use each item. Change Object to Array(Of String)

Secondly, you need to use the item in the ForEach within the Join PDF activity.

So just put item in the FileList property, which represents each string array in fileGroups.

And, that goes for both Join PDF activities.

Hope that helps.

Regards.

@ClaytonM: Thank you it does for the first file where I have multiple files to join (highlighted)

it fails on the 2nd and 3rd one since join activities needs atleast 2 files. I guess my question is how can I count inside the “item” in the list. So I can use it as a condition while running my script.

just put an IF right before the join activity and check item.Length > 1…

1 Like

I’m not sure item.Length will work. I normally use .Count for arrays.

IF: item.Count > 1

You might also want to rename it when there’s only one file, which I think you can use the Move File activity for.

length will be fine when it is a one-dimensional array…

1 Like

@ClaytonM @bcorrea Thank you both.

Both give me the same result. Thank you so much both. @ClaytonM if you are ever in Toronto drinks are on me :slight_smile:

1 Like

Hey @Anshuman_Malik,

Even if you found a solution by now, as I started working on mine earlier, I will leave it here for you as another variant. :slight_smile:
PDFsMergeToSizeLimit.zip (24.1 KB)
You will need to add the pdfs in PDF\ folder prior to running.

This uses the subset sum approach -

  1. computes each subset of the array of pdf file sizes
    example: for an array = {1,2,3,4} => subsets: {1},{2},{3},{4},{1,2},{1,3},{1,4},{2,3},{2,4},{3,4},{1,2,3},{1,3,4},{2,3,4} >

  2. computes the sum of each subset and chooses the subsets that are closest to the set limit. Each element of the array is present only once through the subsets selected.

example: for array = {1,2,3,4,5,6,7} and limit = 6 =>

  • {1,2,4} sum: 7
  • {6} sum: 6
  • {5} sum: 5
  • {3} sum: 3

In this way, joining the pdfs is done in the most efficient way, getting closest to the limit of 10mbs for each output file and minimizing the numer of joins.This might come in handy especially if the number of pdfs that you have to process is quite large.

Give it a try ~!

2 Likes

Thanks for posting that, cause I was interested in the “best” sum algorithm too. You also have a good example to store both file and size into a dictionary, which I liked. :man_cartwheeling:

2 Likes

@ClaytonM
files = Directory.GetFiles(dir, filePattern).Where(Function(f) New FileInfo(f).Length <= sizeLimit).OrderBy(Function(f) New FileInfo(f).Length ).ToArray

you have assigned filepattern as “.pdf”
How can I modify this so it should pick up anythinh except pdf?

using the pattern you cant use deny logic, but since you are using linq, just add it to your Where statement like:
files = Directory.GetFiles(dir, filePattern).Where(Function(f) New FileInfo(f).Length <= sizeLimit And Not f.Contains(".pdf")).OrderBy(Function(f) New FileInfo(f).Length ).ToArray

1 Like

Change filePattern to "*.*" to include all files, then refer to @bcorrea’s suggestion. However, I would make one change to his example to look at the extension (technically a file could be “filename.pdf.xlsx” and contain .pdf, so it’s better to look at extension)

files = Directory.GetFiles(dir, "*.*").Where(Function(f) New FileInfo(f).Length <= sizeLimit And Not Path.GetExtension(f).ToUpper.Equals(".PDF")).OrderBy(Function(f) New FileInfo(f).Length ).ToArray

There could also be a getfiles pattern to say “not .pdf”, however I don’t know enough about the .net getfiles pattern syntax to say for sure and would need to research it, which I don’t have time :smiley:

Regards.

1 Like

Hi @ClaytonM and @bcorrea Thank you for that however when I run both your queries I get this error

Assign files: Value cannot be null.
Parameter name: path

Below if the list of variables I am using for reference:

Right now I have added a dummy directory path(static) in ideal situation this would flow through a variable.

your dir must be empty therefore the error…

@bcorrea if you see my variable image I have it assigned over there

no, your image is not showing dir variable…

got it I moved the folder itself hence the issue.