Extract Text from PDF / specific elements from pdf / Selecting each paragraph / Accessibility Settings

Query : I need to extract information from PDF using Get Text/Get Full Text activity. The activity is not allowing me to scrape text but it is selecting the whole document.

Solution :

When you perform this below setting you override the setting only for the specific pdf .

edit-> Accessibility-> Change Reading Options-> select the “Left-to-right, top-to-bottom reading order”

If you want to apply the setting for all the PDF .

Please follow the below steps .
Edit → Preference (ctrl + k ) → Reading → make the changes to below - > Ok .

3 Likes

Hi, I have a situation where I need to read the individual elements in the PDF document. So I followed the Accessibility setting and using all the activities like Get Text, Get Full Text, Scrapping etc., but land up scrapping the PDF in Portions, where we cannot split the Strings as they don’t come in order. But using the activities, I get them as shown in screen shot. Attached is the PDF I am trying to get values and a screen shot that shows what I am referring as portions.

Please help me get the individual elements like the Agent and Agents number.


Cotizacion.pdf (109.8 KB)

1 Like

Hi Ayerrams,
I am not aware of settings but I have similar use case where I need to take data from pdf, I am using read pdf and making whole pdf as a string and I am performing different string operations to get the relevant data. Let me know if this helps.

Hi Harinath, I need each word in excel individually, so if I scrape the whole document, how do I know which word ends where. Like I need to take the Agente name that has spaces, so how do I know that end of the name?

For sample Can you highlight the text which you needed in pdf and I will provide you a sample workflow

Here I need to get the agente number, then Agente name, then Folio like this every other information in the PDF.

Main.xaml (6.2 KB)
Please find the XAML and provide the pdf path in read pdf activity and it is sample to read the Agente , similarly you can do for other required info.

1 Like

Hi Harinath, First thing, When we read whole PDF document at once is a bad idea as in my case the information doesn’t come in order, Second thing is you shown me the Agent number which even I am able to do. Can you do it for Agent Name.

Attaching the text file which I got after reading the whole PDF.PdfText.zip (2.0 KB)

can you please share your workflow??

Hello Niranjan,

What help is needed for you ? I cant share my workflows as they are confidential.

Thanks,
Hari

I want to locate the highlighted text from pdf file and want to extract the whole sentence containing that particular highlighted word.

Please refer attachment.

Did you get the output?if yes please share!

Hi I have one small doubt regarding extracting same type of elements from multiple PDFs.Here my scenario is I have thousand PDFs and I have to extract material type from each PDF and save it to excel.In one PDF my material type is copper and in another PDF my material type is zinc.So I have thousand PDFs with the material types.I have used anchor base activity inside the snippet.So my doubt is how do I edit my selector in get ocr text so that I can keep get ocr text action common for all PDFs and get material types for 1000 PDFs.

Can you share your workflow???

No I cannot becoz it contains client data.But I have tried using google and microsoft ocr as the pdfs are scanned copies.The challenge I am facing here is there are 1000 pdfs each containing material type.I need to extract material type from all the 1000 pdfs with one common bot.But the thing is profile and scale is changing for each pdf when i am trying to extract material type.In that case how can i develop common bot by keeping fixed scale and profile for all the pdfs.But however i tried using snippet and it did not work.Is there any way to get the data from the scanned copies other than google and microsoft ocr with common bot for all the 1000 pdfs.

hii i want to know that i have pdf containing n number of headers (personal details,office details,…etc) i want to get personal details and office details at a time when i am running the pdf without passing an array of objects i got solution if i given the personal details and office details ifoffice details present at the last of the pdf and personal details as first i dont need to get in between data of the two contents . how can i solve it please help mee…

Here is my solution. You need to flatten your pdf so that the scanned images would by eliminated by giving the individual element selection.
Try print your pdf to pdf and check. Same thing I faced but it got fixed for some of the pdfs.

Thanks

Can anyone mention a solution for the above problem? I am facing the same hurdle, I am not able to element each text word, UI path extracts whole paragram or line together.

please review my post for more details.

hai ,
i have pdf in that i want to extract product some details like price, quantity ,etc…i convert to pdf to string and store in file and do some string operation…but problem is the pdf varies for each number…how can i do

Buddy, I am also stuck very badly in similar problem from last 2days. could you please help me,.,how you archive or implemented this bot. I also need to extract few specific elements from scanned PDF. but facing too much problem with Selectors,.,.earlier everything was working fine for me with Google OCR, but after 1month and google ocr expire,.,i am not able to do this using other OCR,.,if you can share your workflow or technique with me,.,then that will be very helpful for me.

thanks in advance