How to extract specific data from the screenshot based pdf

Smitesh_Aher2 · February 5, 2024, 4:41pm

Hi Team,

I have one screenshot pdf( 2 screenshots in one pdf) and want to extract specific data from it. please help me how to extract data?

mkankatala · February 5, 2024, 4:47pm

Hi @Smitesh_Aher2

Try to read the pdf by using Read PDF with OCR.
Drag and drop the Tesseract OCR in it, follow the below options for properties of OCR as below,
→ Select the Scan option in Profile.
→ Give the 2 value in Scale option.

The Output of Read PDF with OCR in String Variable. You can use regular expressions to scrap the results from Output data.

Check the below image for better understanding,

Hope it helps!!

Smitesh_Aher2 · February 5, 2024, 5:02pm

Thanks @mkankatala

All data is extracted but i want to extract specific value.
Below is paragraph of screenshot,

For Eg:-
Try to read the pdf by using Read PDF with OCR.
Drag and drop the Tesseract OCR in it, follow the below options for properties of OCR as below,
→ Select the Scan option in Profile.
→ Give the 2 value in Scale option.

Give is the key and remaining highlighted part is value so how to extract value from paragraph?

mkankatala · February 5, 2024, 5:09pm

Okay @Smitesh_Aher2

Here is the Regular expression,
The Give is the Keyword which is unique in data, I just given the regex like take the digits, spaces and words after the Give.

Check the below regular expressions,

Smitesh_Aher2 · February 5, 2024, 5:18pm

Have you put this sended expression in assign activity? after read pdf with ocr activity?

Can you please send SS of how to use the expression after read pdf with ocr activity?

mkankatala · February 5, 2024, 5:20pm

Assume the Input Variable is the output of Get PDF with OCR activity.

Use the expression as below in assign activity,

- Assign -> Output = System.Text.RegularExpressions.Regex.Match(Input,"(?<=Give\s+)[\d\s\w]+").Value

Hope you understand!!

Smitesh_Aher2 · February 5, 2024, 5:41pm

Yes @mkankatala now i understand.

Can we use same sended expression for below paragraph?

Try to read the pdf by using Read PDF with OCR.
Drag and drop the Tesseract OCR in it, follow the below options for properties of OCR as below,
→ Select the Scan option in Profile.
→ Give the 2 value in Scale option .

I want to extract after the highlighted (Tesseract) remaining part.

mkankatala · February 5, 2024, 5:45pm

The expression is same but the regular expression is changed.

This is the expression,
System.Text.RegularExpressions.Regex.Match(Input,“(?<=Give\s+)[\d\s\w]+”).Value

The highlighted part with in the double quotes is regular expression it will change in the expression based on which data to be extracted.

Which data you want to extract from pdf…?

Hope you understand!!

Smitesh_Aher2 · February 5, 2024, 5:55pm

Paragraph is like below,

Try to read the pdf by using Read PDF with OCR.
Drag and drop the Tesseract OCR in it, follow the below options for properties of OCR as below,
Select the Tesseract Scan option in Profile.
Give the Tesseract 2
value in Scale option.

n number of Tesseract is present in paragraph so guess, Tesseract is key and remaining forwarded part is value.
i want to print all value in WriteLine activity one by one.

mkankatala · February 5, 2024, 6:08pm

Let’s say as you like there are multiple matches in the Input then we want to get each one… @Smitesh_Aher2

Try the below one,

- Assign -> Matchedvalues = System.Text.RegularExpressions.Regex.Matches(Input,"(?<=Tesseract\s+).*")

If we want single match use the below one, which will get the first match,
System.Text.RegularExpressions.Regex.Match(Input,“(?<=Tesseract\s+).*”).Value

If we want all matched then use below expression,
System.Text.RegularExpressions.Regex.Matches(Input,“(?<=Tesseract\s+).*”)

Note : Matchedvalues is the Variable of IEnumerable<Matches>datatype, which is the collection datatype to store the matched values of regex in the input data. To get the each item then use the for each activity to iterate the each value in MatchedValues. Pass the CurrentItem in write line activity to print one by one.

Check the below image there are 3 matches shown in the image,

Hope you understand!!

Smitesh_Aher2 · February 6, 2024, 5:13am

Hi @mkankatala

System.Text.RegularExpressions.Regex.Matches(Input,“(?<=Tesseract\s+).*”)

By using above expression i am getting 4 values in the output panel like below,

Try to read the pdf by using Read PDF with OCR.
OCR in it, follow the below options for properties of OCR as below,
Scan option in Profile.
2

I don’t want highlighted part in the output.

srinivasmarneni · February 6, 2024, 5:21am

Detailed flow for
—> extract data from a screenshot-based PDF using UiPath:

Open UiPath Studio: Start by creating a new process or opening an existing one where you want to extract data from the PDF.
Install Required Packages:

Go to “Manage Packages” in UiPath Studio.
Install the UiPath.PDF.Activities package.
Install the OCR engine of your choice (like UiPath.TesseractOCR.Activities for Tesseract).

Read PDF with OCR Activity:

Drag and drop the “Read PDF With OCR” activity into your sequence.
In the properties panel, set the “FileName” property to the path of your PDF file.
Select an OCR engine from the “OCR Engine” dropdown (e.g., Tesseract OCR).
Configure the properties of the OCR engine (like “Scale” or “Languages”).

Extract Data:

The “Read PDF With OCR” activity will output a string variable, which you should create and name (e.g., extractedText).
If you need to extract specific information (like dates, names, numbers), you may use an “Assign” activity with a Regex pattern to get the specific data.
- Create a new string variable for the specific data (e.g., specificData).
- Use System.Text.RegularExpressions.Regex.Match(extractedText, yourRegexPattern).Value to assign the value to specificData.

Conditional Logic for Each Screenshot:

If there are two screenshots in the PDF, and you only need data from one, use an “If” activity to determine which part of the extractedText to process.
Use the “Substring” method or additional “Regex.Match” calls to isolate the part of the text you need from each screenshot.

Output the Data:

Use “Write Line” activities to output the extracted data to the UiPath output panel for verification.
Alternatively, use “Write Range” or “Write Cell” activities to write the data to an Excel file or other formats as needed.

mkankatala · February 6, 2024, 5:23am

See the below output, I am getting only three. I don’t know why you are getting four… @Smitesh_Aher2

workflow -

Output -

system · February 9, 2024, 5:23am

This topic was automatically closed 3 days after the last reply. New replies are no longer allowed.

Topic		Replies	Views
How to Extract data from different formats of pdf Studio	0	660	April 21, 2020
I try to extract a specific data from pdf Studio pdf , question	2	805	March 7, 2020
How to take screenshot from multiple pages Something Else feedback	12	202	March 14, 2024
How to extract pdf table data on 2 page Studio studio , question , activities_panel	2	313	May 26, 2023
PDF Scrapping get data from PDF Studio pdf , studio , question , landing_screen , pdf-extraction	1	74	March 20, 2024

Most Active Users - Yesterday
ashokkarale
Yoichi
vineelag
Arvind_Kumar1
asshiyuta
J0ska
Foxtrek_64
Murali_Boni
arivu96
SenzoD
More details...

How to extract specific data from the screenshot based pdf

Related Topics