I am a bit unclear about efficiently extracting data from scanned pdf files. I have multipage brokerage account statements which are scanned pdf’s. Each page is numbered Page X of Y (e.g., Page 1 of 4, Page 2 of 4, etc.) at the bottom right. On top left of each page, there is an Account Number eg., “Account Number 1111-22222”. I want to extract the account number and the “page X of Y” from each page.
So, for this 4 page document example, I expect to get extracted data as follows:
“Account Number 1111-22222” “Page 1 of 4” “Account Number 1111-22222” “Page 2 of 4” “Account Number 1111-22222” “Page 3 of 4” “Account Number 1111-22222” “Page 4 of 4”
How can I extract these fields efficiently? Should I OCR the entire document using ReadPdfWithOcr then parse the result to extract my data? How can I best extract only the fields I care about? Is screen scraping an appropriate choice? Thanks in advance.
you can use ReadPdfWithOCR activity and get the String variable as strOut, Now to extract Account number and Page x of y, you can use Matches activity.
use strOut as input to Matches activity and pass the below patterns:
That’s what I am doing currently. The problem is that the OCR process takes too long. I suspect the reason is because the OCR first has to read everything from each page. Instead, if the OCR were to extract only the data fields I need, I think it might run faster. Any thoughts?
Hi @amodsinghal, try with document understanding feature in uipath
I am looking into Document Understanding. However, it appears that even here it first OCR’s the entire document, after which I can apply different methods to extract the data of interest. Since the OCR part is what takes time (for example, OCR’ing a 10 page document will take about 10 times longer than a 1 page document), I am not sure I understand how this will be any faster. What am I missing? Thanks.