Scanned pdf data extraction

raju_alakuntla · April 19, 2024, 6:26am

Hi,

there is a scanned document converted to pdf and we want to extract Name of the person, course and CGPA from it where the anchor position/name of the anchor and targets changes. Here I have attached the sample document.

I have tried with Get text activity but I am facing issues when the anchor position or name changes.

how can we achieve this.

Regards,

pravallikapaluri · April 19, 2024, 6:34am

Hi @raju_alakuntla

Read the Pdf using Read PDF Text with ocr activity and get output in the string variable.
Use Find matching Pattern activity or assign activity by using regex expressions and get the required output.

Hope it helps!!

sven.wullum1 · April 19, 2024, 6:36am

Hi!
You could use OCR to extract the text from the document.
Studio - OCR Activities (uipath.com)

And then try to structure the text in some way.

You could also use document understanding, but there’s a cost associated with that.
Activities - About the PDF Activities Package (uipath.com)
Document Understanding - Introduction (uipath.com)

Abhijna_Naik98 · April 19, 2024, 6:38am

Hello @raju_alakuntla ,
You can use Read PDF Text Activity and use Regex Expressions to extract the required values.

Julian_Muhlbauer · April 19, 2024, 6:40am

Hi @raju_alakuntla - Is it just that one document, or are there many? Do they all follow the same structure? I think the solutions already mentioned should work. However, if the structure changes, it might be worth investigating document understanding capabilities.

Parvathy · April 19, 2024, 6:50am

Hi @raju_alakuntla

=> Use Read PDF with OCR activity. Change the properties of Read PDF with OCR as below image.

Use Tesseract OCR Engine and keep the properties like below:

Save the output of Read PDF with OCR as Text.

=> Use the below syntax in Assign activities:

Assign activity -> Name = System.Text.RegularExpressions.Regex.Match(Text,"(?<=NAME\:\s+)(.*)").Value.Trim()

Assign activity -> Technology = System.Text.RegularExpressions.Regex.Match(Text,"(.*?)(?=\sNAME)").Value.Trim()

Assign activity -> CGPA = System.Text.RegularExpressions.Regex.Match(Text,"(?<=CGPA\s+)(.*)").Value.Trim()

Output:

Uploaded the xaml file for your reference
Sequence10.xaml (10.3 KB)

Hope it helps!!

Topic		Replies	Views
How to get desired text from a document Studio uiautomation	33	477	October 30, 2023
Extract unstrucured Data From PDF and not with a fixed Position on each Page Help uiautomation , pdf , activities , studio	7	2298	August 16, 2019
I try to extract a specific data from pdf Studio pdf , question	2	847	March 7, 2020
Dynamic OCR data Extraction from PDF Help uiautomation , activities	2	1108	October 16, 2019
Reading PDF and extracting specific text using Anchor Base Help pdf , activities , question	10	4630	April 20, 2020

Scanned pdf data extraction

Related topics