Online PDF data extraction

Hi All,

Got stuck on the below two. Could someone please help me on the below.

  1. Since the PDF which am currently viewing is non-downloadable, I want to extract all the contents of an online PDF. That is the PDF opened in a browser(as the same pdf is non-downloadable).How can i achieve this?

  2. For extracting the text from an image, we tend to use Read PDF with OCR, but If i want to get the image as it is in the pdf, how can i achieve this.

Request your help here.

Thanks,
Hirunyaa

It should use your default adobe reader inside the browser, so you should be able to save the PDF by performing the Ctrl+Shift+S keystroke with either Send Hotkey or Type Into. But, if not, you would need to use standard element or image activities on the pdf container in the browser (you might need to activate the accessibility assistance with ctrl+shift+5)

You can use Find Image (or Find Element if elements exist) to locate it and store it to an element variable (in Output property). This can be tricky though if the image is pixelated and could require zooming in. With the element variable, you can use Take Screenshot and Save Image.

Regards.

Hi Clayton,

Thanks for your response !
I require some more clarification.

The above link is My sample online PDF that has text contents, tabular data, images, etc. Though it has the download button, i am not allowed to download it. So how can I get all these contents from an online PDF to a word document as it is. Am really confused here.
Could anyone please help me out ?

Thanks in advance for the help !

What do you mean you are “not allowed”. It’s typically easier for data extraction from the desktop Adobe Reader application. So was just wondering. Looks like to save it, it’s Ctrl+s keystroke

If you want to extract certain parts of the document, there are a few things you can try. One is using element activities. This will require that you activate a setting by doing Ctrl+Shift+5 before, which will allow you to select different elements.
image
as shown above.

However, the elements are not usually that easy to identify on PDFs. So, you may need to use Find image on the graph, then you can use Take Screenshot using the element variable. However, this also poses challenges because the image needs to be on the screen, and so you would need to perform scrolling actions inside a Retry scope to ensure the image will be on the screen and found. Then, to get the text, you can use Read PDF to Text and extract text by keywords.

BUT, if all you want to do is convert this document to a Word document, you can reference some searches online, as I am not entirely sure:
https://www.google.com/search?q=open+pdf+in+word&rlz=1C1GCEV_enUS859US859&oq=open+pdf+in+word&aqs=chrome..69i57.2383j0j0&sourceid=chrome&ie=UTF-8

It would require you to save the document either way.

Regards.

It seems like you need a fully-featured PDF editor. Grab one from Movavi: https://pdf.movavi.com/ It’s really functional and it can be really often seen in top-5s.