Extract Text from PDF / specific elements from pdf / Selecting each paragraph / Accessibility Settings

pdf
activities

#1

Query : I need to extract information from PDF using Get Text/Get Full Text activity. The activity is not allowing me to scrape text but it is selecting the whole document.

Solution :

When you perform this below setting you override the setting only for the specific pdf .

edit-> Accessibility-> Change Reading Options-> select the “Left-to-right, top-to-bottom reading order”

If you want to apply the setting for all the PDF .

Please follow the below steps .
Edit -> Preference (ctrl + k ) -> Reading -> make the changes to below - > Ok .


#2

Hi, I have a situation where I need to read the individual elements in the PDF document. So I followed the Accessibility setting and using all the activities like Get Text, Get Full Text, Scrapping etc., but land up scrapping the PDF in Portions, where we cannot split the Strings as they don’t come in order. But using the activities, I get them as shown in screen shot. Attached is the PDF I am trying to get values and a screen shot that shows what I am referring as portions.

Please help me get the individual elements like the Agent and Agents number.


Cotizacion.pdf (109.8 KB)


#3

Hi Ayerrams,
I am not aware of settings but I have similar use case where I need to take data from pdf, I am using read pdf and making whole pdf as a string and I am performing different string operations to get the relevant data. Let me know if this helps.


#4

Hi Harinath, I need each word in excel individually, so if I scrape the whole document, how do I know which word ends where. Like I need to take the Agente name that has spaces, so how do I know that end of the name?


#5

For sample Can you highlight the text which you needed in pdf and I will provide you a sample workflow


#6

Here I need to get the agente number, then Agente name, then Folio like this every other information in the PDF.


Extract number from pdf
#7

Main.xaml (6.2 KB)
Please find the XAML and provide the pdf path in read pdf activity and it is sample to read the Agente , similarly you can do for other required info.


#8

Hi Harinath, First thing, When we read whole PDF document at once is a bad idea as in my case the information doesn’t come in order, Second thing is you shown me the Agent number which even I am able to do. Can you do it for Agent Name.

Attaching the text file which I got after reading the whole PDF.PdfText.zip (2.0 KB)