I am using Screen Scrapping for scrapping pdf data. It’s Works fine. But when I am using same scenario for other PDF. Its showing error.
How can I make screen scrapping dynamic, so it can work for all PDF files.
Please give suggestions.
Thanks in advance.
@ClaytonM @ecarles @Pablo_Sanchez @bagishojha
It depends on the PDF files. Some are easily scrapable and some are not which require more string manipulation to achieve. Also, if you are trying to scrape using elements then make sure you have assistive technology on for the file (I think it is enabled with Ctrl+5 or something).
If you can provide the error you get or any differences between the PDF files, we might be able to make some suggestions.
Thanks for reply…
In my case, PDF reading is not problem, I am scrapping the data from one pdf. But when I try for another pdf with same workflow scenario. Its giving something selectors error.
How can I make it as a dynamic?
Can you show what the selector is you are using?
It could be that you have the filename in the selector which changes file to file.
If that is the case, though, you would need to replace the filename with either a wildcard or preferable a variable that represents the filename so it will work with multiple PDFs open. You can do this by clicking inside the Selector Property on the right side and editing the selector as a string. (ie “< title=’*”+invoice+"*’ />"
Thanks for reply.
I will try.
Above is my actual selector.
I am saving the file name in fileNAME variable.
Exactly how can I edit?
I am trying but it gives error.
Please give suggestions.
Typically, to edit the Selector to include a variable, you will need to edit the selector as a string. To do so, click in the Selector property as shown in the image previously, but do not click on the Edit button with 3 dots. You will notice that in the Selector property, the selector is a string surrounded by quotes… now, if you remove a character, click outside of the box, then click on Edit, it will bring up the Expression editor where you can edit the string more freely (just make sure you put back whatever character you deleted). - or you can just simply edit the string in the Selector property box, but it is very small.
All in all, you will end up having a selector string that looks like this:
"<wnd app='acrord32.exe' cls='AcrobatSDIWindow' title='"+fileNAME+"*' />"
However, it is good practice to replace the extension in the filename and surround both sides with a wildcard.
With that in mind, we can use Path.GetFilenameWithoutExtension(), and it will look like this:
"<wnd app='acrord32.exe' cls='AcrobatSDIWindow' title='*"+Path.GetFilenameWithoutExtension(fileNAME)+"*' />"
Additionally, if you want, you can create a string variable to store this selector before you use it, so you can use it on multiple activities with better maintainability.
If you receive any errors, feel free to post them and I’ll see if I can help identify the issue.
Thanks for suggestion.
It working fine.
Now, I am using screen scrapping. There is two different format data, Can I scrape effectively, screen scrapping not gives me a perfect data.
For 1st image it gives me exact output bur for second image the output is in some unreadable text.
I’m assuming it’s an image and you must use OCR?
You might need to play with the zoom on the document to make the text bigger or smaller, and to try out various scales in your OCR engine. Also, you may try different OCR engines, while Abbyy is probably the best if you have accessibility to it.
Thanks for suggestion @ClaytonM
Hey I am scrapping the elements from Pdf, But Both pdf elements having different clipping reason.
Any solution for these?
Can we dynamically scrape the elements if clipping region is different?
I am attempting to do the same. Which screen scrapping activity are you using to dynamically scrape the pdfs? Are you using relative scraping?
You can use Scrape Relative. It works fine.
I am able to get the relative scrape to work, but now am running into the issue of having to scroll through the pdf to find the items. Is there a way perform this without having to go through all the pages of the PDF?