Extract seperate chapters from PDF

trabart · July 4, 2017, 12:47pm

Hello everyone!

I need to make tables of content for different PDF books in in webpage format.
The idea is that I can click on a specific chapter on the webpage, and that the corresponding chapter will pop up as a pdf.

I already find a way to make this webpage in a fairly simple and fast way, but extracting all these chapters in Adobe Acrobat out of a pdf is quite a time-consuming job, so I think that I can automate this extracting in UiPath.

This is a link to one of the books that I want to cut in pieces so I can make the chapters in the webpage link to the corresponding chapter.
https://www.rechtspraak.nl/SiteCollectionDocuments/ENCJ-rapport-Independence-Accountability-and-Quality-of-the-Judiciary.pdf

In this example there is no word indicating a chapter, but in most other cases the table of contents says
for example “Chapter 8 Outcomes of the Quality Pilot Study” instead of just “8. outcomes of the Quality Pilot Study”, or the Dutch word chapter “Hoofdstuk 8”. Also, the actual chapters will be put in a different directory than all parts that are not chapters, such as “Executive Summary and Recommendations” in this example.

I roughly have an idea how to make such a robot. After an “For Each” activity, I will let UiPath look for each sentence in the webpage that starts with a number or “Chapter” (or “Hoofdstuk” in case the book is in Dutch), and then check for the pagenumber at the end of the sentence, and then compare this with the pagenumber of the next string, so I will find the index pagenumbers of the individual chapter. I will copy these numbers to Adobe Acrobat, as you can see in the picture, so I can extract the pages.

This is however a problematic method in case the last chapter doesn’t end at the same page as the whole book, because when there comes a registry of bibliography after the last chapter, the extracted part will contain more than just the chapter. I will have to remove the registry or bibliography manually. But I think the robot will still save me some time, even when I have to do this by hand. Also it could be a problem in case a chapter doesn’t start on a new page, but instead shares a page with the previous chapter, but I think the majority ofthe books have new chapters started on a new sheet.

I would like to ask you, as a rookie, how I can read the webpage. In other words, I would like to know if you have ideas about what I can do so UiPath knows which pages need to be selected, and to repeat this proces for each individual chapter.

I would be very happy if you can share ideas : it is also fine if you know another approach to cut extract the chapters from the PDF’s.

Topic		Replies	Views
Table Extraction and Splitting in pdf using UiPath Studio studio , question , activities_panel	4	163	January 18, 2024
Take a specific piece of text and the page it is on, from a PDF file Activities pdf , activities , question	11	1272	April 14, 2023
How to split pdf pages and extract? Help pdf , activities , question	4	15050	September 25, 2020
How to Extract an PDF content like a table to Excel? Help	0	1240	January 17, 2018
Split PDF with respect to tables present in it Activities pdf , activities , question	13	309	January 27, 2024

Most Active Users - Yesterday
ashokkarale
ppr
Anil_G
Ajay_Mishra
Yoichi
mhaniff
Shiva_Nikhil
Anonymouss
quick_123
vrdabberu
More details...

Extract seperate chapters from PDF

Related Topics