PDF automation- How to get data in PDFs which is having paragraph

priya_joshi_thaneti · May 29, 2020, 6:48am

Hi everyone,

Can anyone help me to get the data in the PDF files which is having different paragraphs in it and each paragraph is having a heading .

I will pass the heading from asset and bot should be able to scrape the data below that heading.

can anyone please suggest me on how to do it ,.
file-example_PDF_1MB.pdf (1017.7 KB) This is the sample PDF I found online.

or can we just divide the PDF and get the data into paragraphs which an empty line after each and every line ?
When I was working with read PDF as text , it was not giving me any empty lines after each paragraph .

avejr748 · May 29, 2020, 7:44am

If your heading are predefined/known, you can do that by making your PDFText into an array then searching the array for the heding text and take note of the heading text index in the array. Then extract all the lines/indexes after the heading until the next heading

priya_joshi_thaneti · May 29, 2020, 7:57am

Hi @avejr748,

How can I know about my next heading ? it is not fixed string . can you help here
I only know my start heading.

avejr748 · May 29, 2020, 8:00am

If the headings are not known beforehand then that would be a big challenge as your PDF is already at text form now so no more identifiers unless the PDF is standard that there is no line breaks except when going into the next header

priya_joshi_thaneti · May 29, 2020, 8:18am

Hi @avejr748
is there any chance to find the line breaks between paragraphs ?

avejr748 · May 29, 2020, 9:15am

Yes… it would be a line without any characters in it. You can use Regex to match/find it

priya_joshi_thaneti · May 29, 2020, 10:52am

hi @avejr748
when I using read PDF … its giving me the strings continuously, without any empty lines.Im having this issue

avejr748 · May 29, 2020, 11:09am

Im using UiPath PDF activities read pdf text and it is getting the text on the PDF exactly as it is in the PDF. Like how every line is in its own line and not a continuous text.

Topic		Replies	Views
Extracting data from pdf with fixed headings Studio studio	2	930	May 29, 2020
Get info from PDF Help activities , question	5	1292	October 1, 2020
PDF Extraction---- help Studio pdf , studio , question , activities_panel , pdf-extraction , emailtopdf , pdf-conversion , pdf-to-image , pdf-tag	3	831	October 7, 2022
Extracting Data from PDF's Robot pdf , robot , activities , studio , question , notepad , data-extraction	10	997	March 31, 2022
Extract specific data from lined PDF Help pdf , activities , regex , question	5	1028	January 22, 2020

PDF automation- How to get data in PDFs which is having paragraph

Related topics