Extract seperate chapters from PDF

pdf

#1

Hello everyone!

I need to make tables of content for different PDF books in in webpage format.
The idea is that I can click on a specific chapter on the webpage, and that the corresponding chapter will pop up as a pdf.

I already find a way to make this webpage in a fairly simple and fast way, but extracting all these chapters in Adobe Acrobat out of a pdf is quite a time-consuming job, so I think that I can automate this extracting in UiPath.

This is a link to one of the books that I want to cut in pieces so I can make the chapters in the webpage link to the corresponding chapter.

In this example there is no word indicating a chapter, but in most other cases the table of contents says
for example “Chapter 8 Outcomes of the Quality Pilot Study” instead of just “8. outcomes of the Quality Pilot Study”, or the Dutch word chapter “Hoofdstuk 8”. Also, the actual chapters will be put in a different directory than all parts that are not chapters, such as “Executive Summary and Recommendations” in this example.

I roughly have an idea how to make such a robot. After an “For Each” activity, I will let UiPath look for each sentence in the webpage that starts with a number or “Chapter” (or “Hoofdstuk” in case the book is in Dutch), and then check for the pagenumber at the end of the sentence, and then compare this with the pagenumber of the next string, so I will find the index pagenumbers of the individual chapter. I will copy these numbers to Adobe Acrobat, as you can see in the picture, so I can extract the pages.

This is however a problematic method in case the last chapter doesn’t end at the same page as the whole book, because when there comes a registry of bibliography after the last chapter, the extracted part will contain more than just the chapter. I will have to remove the registry or bibliography manually. But I think the robot will still save me some time, even when I have to do this by hand. Also it could be a problem in case a chapter doesn’t start on a new page, but instead shares a page with the previous chapter, but I think the majority ofthe books have new chapters started on a new sheet.

I would like to ask you, as a rookie, how I can read the webpage. In other words, I would like to know if you have ideas about what I can do so UiPath knows which pages need to be selected, and to repeat this proces for each individual chapter.

I would be very happy if you can share ideas : it is also fine if you know another approach to cut extract the chapters from the PDF’s.