How to split and extract from huge PDF file


#1

Hi Experts,

Would like to ask for help on how to automate in UiPath the split PDF function of Acrobat Pro Xi as I need to split a huge PDF file (around 7000 pages) to individual PDF files based on tracking number. Am thinking to split first the huge PDF with range of 100 maybe then loop thru the pages of splitted PDF to extract the pages for each tracking number which may have 2 or more pages then save extracted pages to new folder.

I have already tried using send hotkeys ALT+VTP and arrow down keys to access Split function but seems not working. Am still new to UiPath and not sure how to execute this. Looking forward to hearing from you. Thanks.

Cheers,
JPOkawa


#2

Hi,

This is just my quick thinking.
If Read PDF to text activity takes too long or crashes (like I think it will), you can open the file and perform a Select All, which is Ctrl+A followed by Ctrl+Shift+End. Then, do a Ctrl+C to copy all text to the clipboard. This can be tricky, however, cause you need to verify that all text has been placed in the Clipboard or you will miss some information.

Once you have the text stored in a string variable, then you can use .Split() to create an array by your tracking number or a key word that identifies the split. It might also require some additional massaging of the text, which in that case, I would recommend LINQ. Like for example if you wanted to format it into a CSV-comma delimitted file, you could do that.

With your array, you should be able to run that through a For each and write each part to a separate file.

I hope this helps spark some ideas.

Regards.


#3

Hi @ClaytonM,

Thanks for sharing your thoughts. Actually, we just wanted to split the huge PDF into individual PDF files based on tracking number thus we are thinking to automate using Acrobat Pro Xi. Each page has tracking number and information in tabular format which we also want to keep as is.

Regarding LINQ, I don’t have any experience on that. Would you please provide a sample workflow making use of LINQ that I can use for this purpose?

Would you also please kindly share sample xaml using the approach you mentioned? Thanks in advance!

Regards,
JPOkawa


#4

Hi @JPOkawa,

As mentioned splitting the pdf based on tracking ID and each page in tabular format.

  1. You can make use of data scraping if you want to extract the data follow below link:
    Extract table data from multiple pages of pdf

  2. If it is in proper tabular format and have same delimeter you can try the screen scraping and generate the data table.

  3. Read PDF—>Store it in Output variable–>Split using string as delimeter as you mentioned in your case it is tracking ID->Analyze pattern to further split–>Process each item page by page

Thanks
Girish