Scanned pdf data extraction


there is a scanned document converted to pdf and we want to extract Name of the person, course and CGPA from it where the anchor position/name of the anchor and targets changes. Here I have attached the sample document.

I have tried with Get text activity but I am facing issues when the anchor position or name changes.

how can we achieve this.


Hi @raju_alakuntla

Read the Pdf using Read PDF Text with ocr activity and get output in the string variable.
Use Find matching Pattern activity or assign activity by using regex expressions and get the required output.

Hope it helps!!

1 Like

You could use OCR to extract the text from the document.
Studio - OCR Activities (

And then try to structure the text in some way.

You could also use document understanding, but there’s a cost associated with that.
Activities - About the PDF Activities Package (
Document Understanding - Introduction (

Hello @raju_alakuntla ,
You can use Read PDF Text Activity and use Regex Expressions to extract the required values.

Hi @raju_alakuntla - Is it just that one document, or are there many? Do they all follow the same structure? I think the solutions already mentioned should work. However, if the structure changes, it might be worth investigating document understanding capabilities.

Hi @raju_alakuntla

=> Use Read PDF with OCR activity. Change the properties of Read PDF with OCR as below image.

Use Tesseract OCR Engine and keep the properties like below:

Save the output of Read PDF with OCR as Text.

=> Use the below syntax in Assign activities:

Assign activity -> Name = System.Text.RegularExpressions.Regex.Match(Text,"(?<=NAME\:\s+)(.*)").Value.Trim()

Assign activity -> Technology = System.Text.RegularExpressions.Regex.Match(Text,"(.*?)(?=\sNAME)").Value.Trim()

Assign activity -> CGPA = System.Text.RegularExpressions.Regex.Match(Text,"(?<=CGPA\s+)(.*)").Value.Trim()


Uploaded the xaml file for your reference
Sequence10.xaml (10.3 KB)

Hope it helps!!