How to use regular expressions for pdf document


I have three pdf files in that I need to extract data below the field called Description or PRODUCT NAME or Description & Specification of Goods from those files using regular expressions.

One pdf file is scanned image in three of the files, I used ABBYY ocr to read the pdf but the output is not efficient since it has some misspelling of words…

Is anyone know how to solve both the problems?

these are all the pdf filesInvoice 5.pdf (51.2 KB)
Invoice 8.pdf (159.6 KB)
Invoice 7.pdf (47.4 KB)


I believe replies on those posts could be relevent to you too.


You need identify the onset of tabular data based on some keywords and for each product, Split column based on delimiter and place it in data Table. you can then extract the necessary columns easily.
As far as i know, Abby OCR extract with more accuracy. You could have faced issues while processing Invoice 7 document because of legibility of the document.

@lissynikkytha I tried using abbyy but it is not extraction the word properly…


Did you try modifying the scale? Are you facing issues with Invoice 5 and Invoice 8 samples as well?

@lissynikkytha thank you… it is working, but how can I apply regex expression for that to extract only the product name from the output text file?

To use regex use the “Matches” activity which you can find in the activities panel by searching matches.