Scrap Data from a table in the given pdf

Can anybody tell, how to extract a specific table from the given pdf or a scanned PDF without actually opening document . With help of Read PDF with OCR Activity , I can scrap all the data from the pdf. But I need to scrap data from the dynamic table and some other data present in specific regions. I tried with “Read text from specific region”, but in that I don’t know what values to be given in height,width, X& Y in the input property window. Since, the document will be dynamic and should not be opened in a window, the OCR and screen scraping activity cannot be used. I’m trying to complete this project for more than a week. It would be grateful if someone help me through it.

Hi @SANJAI_M,

Have you tried using the Read PDF Text package? Sometimes, the tables will be formatted in a logical manner as a String and then you can do string manipulation to extract the necessary values.

-Joseph

2 Likes

yes @joseph.yoon I tried using both the Read pdf text and read PDF with OCR activity.In the Read pdf text activity returns the output as the whole text present in the pdf. The Read PDF with OCR activity works the same in which only the Page numbers can be given as input in the Range property. But, I wanna get text from a specified region . Also, if the data contains lines for indicating the values as separate ones like in rows & columns. Then, I would have used Data scraping But it doesn’t have any lines in the table, simply there will be some spaces between them, Which makes it difficult for me complete the project.

Would you have a sample pdf file?

Sry @joseph.yoon. Its an official file, that I couldn’t share