I am a complete newbie in RPA and trying to build my first robot. Hope I am in the correct section of Forum
I am trying to scrape data from pdf files and list them on excel. I have seen in a tutorial that we should normally change acrobat’s accessibility settings with CTRL+SHIFT+F5 combination (Change reading options) and activate Tagged Reading order.
In my version of Acrobat, there is no Tagged Reading Order, rather I have these below options.
I found that “Use reading order in raw print stream” working somehow but this time it is not selecting properly and select too many rows, lines, anchors together:
It should only get the value next to Total Amount (USD) but it gets 2 rows with its Headers together. (I couldn’t upload the second picture because of forum rules)
welcome to the community!
So your pdf is like an image type one right? Is it spamming across several pages? If you need a lot of information in your pdf, then a better option would be to read the entire document and then go for the values after but if you are happy with the results so far, we can work on that get text to extract only the text you need.
It’s searchable format but normally UI path’s get text function don’t work to scrape a specific text, rather it gets every text at once. Or at least I don’t know another way
yes, get text will get everything, it will be the developers choice to scrap the application containing the pdf opened (easier but slower) or get the whole text and then extracting the parts we want (faster but more trouble).