I am trying to extract Issue Data from Emma website and CUSIP numbers in the first column are not recognized by Data Scraping tool. All other columns are extracted perfectly.
To get to the table:
First, thank you for posting the steps to get to the table. This helps us a lot in troubleshooting.
I’ve looked into the UI Tree in UIExplorer, and the CUSIP field is actually an image, rather than text. This is why it is returning as blank. Scraping the text from the image will be complicated, unfortunately.
To get the elements, you will need to iterate over the table rows (starting with row 3) using this selector (modify for use with browsers other than Chrome)
You’ll need to have a variable in place of the 3 to iterate over the rows in a While loop and get the image using an OCR activity on the selectors (Tesseract will be sufficient for images like this, but use whichever you like). Use that to get the text data and store it in the empty fields of your table.
Not sure why they’re storing this field as an image, but it makes automation take additional steps here.
Just to make sure you are on the right track, you wont be able to use Data Scraping if that first column is needed because of that being an image… You will need to automate the crapping manually, including the paging. Let us know if you need any further assistance!
Thank you @bcorrea and @Anthony_Humphries!
I was able to apply OCR Tesseract Screen scraping tool. It worked better when I followed CUSIP image hyperlink and scrapped from the new page (vs. trying to recognize the image in the table).
The quality of recognition is not ideal (especially for CUSIP 14800PEW3 values), so I played with different scale values - 35 works best for now. Are there any other features I can play with to improve recognition accuracy?
There are other OCR tools, but I’ve found best results tweaking the scale and trying different screen reading methods. Tesseract supports None, Scan, Screen, and Legacy. I’ve had the best luck with Scan and Screen.