PDF automation- How to get data in PDFs which is having paragraph

Hi everyone,

Can anyone help me to get the data in the PDF files which is having different paragraphs in it and each paragraph is having a heading .

I will pass the heading from asset and bot should be able to scrape the data below that heading.

can anyone please suggest me on how to do it ,.
file-example_PDF_1MB.pdf (1017.7 KB) This is the sample PDF I found online.

or can we just divide the PDF and get the data into paragraphs which an empty line after each and every line ?
When I was working with read PDF as text , it was not giving me any empty lines after each paragraph .

If your heading are predefined/known, you can do that by making your PDFText into an array then searching the array for the heding text and take note of the heading text index in the array. Then extract all the lines/indexes after the heading until the next heading

1 Like

Hi @avejr748,

How can I know about my next heading ? it is not fixed string . can you help here
I only know my start heading.

If the headings are not known beforehand then that would be a big challenge as your PDF is already at text form now so no more identifiers unless the PDF is standard that there is no line breaks except when going into the next header

1 Like

Hi @avejr748
is there any chance to find the line breaks between paragraphs ?

Yes… it would be a line without any characters in it. You can use Regex to match/find it

hi @avejr748
when I using read PDF … its giving me the strings continuously, without any empty lines.Im having this issue

Im using UiPath PDF activities read pdf text and it is getting the text on the PDF exactly as it is in the PDF. Like how every line is in its own line and not a continuous text.
image