How to extract specific data from the screenshot based pdf

Hi Team,

I have one screenshot pdf( 2 screenshots in one pdf) and want to extract specific data from it. please help me how to extract data?

Hi @Smitesh_Aher2

Try to read the pdf by using Read PDF with OCR.
Drag and drop the Tesseract OCR in it, follow the below options for properties of OCR as below,
→ Select the Scan option in Profile.
→ Give the 2 value in Scale option.

The Output of Read PDF with OCR in String Variable. You can use regular expressions to scrap the results from Output data.

Check the below image for better understanding,
image

Hope it helps!!

Thanks @mkankatala

All data is extracted but i want to extract specific value.
Below is paragraph of screenshot,

For Eg:-
Try to read the pdf by using Read PDF with OCR.
Drag and drop the Tesseract OCR in it, follow the below options for properties of OCR as below,
→ Select the Scan option in Profile.
→ Give the 2 value in Scale option.

Give is the key and remaining highlighted part is value so how to extract value from paragraph?

Okay @Smitesh_Aher2

Here is the Regular expression,
The Give is the Keyword which is unique in data, I just given the regex like take the digits, spaces and words after the Give.

Check the below regular expressions,

Have you put this sended expression in assign activity? after read pdf with ocr activity?

Can you please send SS of how to use the expression after read pdf with ocr activity?

Assume the Input Variable is the output of Get PDF with OCR activity.

Use the expression as below in assign activity,

- Assign -> Output = System.Text.RegularExpressions.Regex.Match(Input,"(?<=Give\s+)[\d\s\w]+").Value

Hope you understand!!

Yes @mkankatala now i understand.

Can we use same sended expression for below paragraph?

Try to read the pdf by using Read PDF with OCR.
Drag and drop the Tesseract OCR in it, follow the below options for properties of OCR as below,
→ Select the Scan option in Profile.
→ Give the 2 value in Scale option .

I want to extract after the highlighted (Tesseract) remaining part.

The expression is same but the regular expression is changed.

This is the expression,
System.Text.RegularExpressions.Regex.Match(Input,“(?<=Give\s+)[\d\s\w]+”).Value

The highlighted part with in the double quotes is regular expression it will change in the expression based on which data to be extracted.

Which data you want to extract from pdf…?

Hope you understand!!

Paragraph is like below,

Try to read the pdf by using Read PDF with OCR.
Drag and drop the Tesseract OCR in it, follow the below options for properties of OCR as below,
Select the Tesseract Scan option in Profile.
Give the Tesseract 2
value in Scale option.

n number of Tesseract is present in paragraph so guess, Tesseract is key and remaining forwarded part is value.
i want to print all value in WriteLine activity one by one.

Let’s say as you like there are multiple matches in the Input then we want to get each one… @Smitesh_Aher2

Try the below one,

- Assign -> Matchedvalues = System.Text.RegularExpressions.Regex.Matches(Input,"(?<=Tesseract\s+).*")

If we want single match use the below one, which will get the first match,
System.Text.RegularExpressions.Regex.Match(Input,“(?<=Tesseract\s+).*”).Value

If we want all matched then use below expression,
System.Text.RegularExpressions.Regex.Matches(Input,“(?<=Tesseract\s+).*”)

Note : Matchedvalues is the Variable of IEnumerable<Matches>datatype, which is the collection datatype to store the matched values of regex in the input data. To get the each item then use the for each activity to iterate the each value in MatchedValues. Pass the CurrentItem in write line activity to print one by one.

Check the below image there are 3 matches shown in the image,

Hope you understand!!

Hi @mkankatala

System.Text.RegularExpressions.Regex.Matches(Input,“(?<=Tesseract\s+).*”)

By using above expression i am getting 4 values in the output panel like below,

Try to read the pdf by using Read PDF with OCR.
OCR in it, follow the below options for properties of OCR as below,
Scan option in Profile.
2

I don’t want highlighted part in the output.

Detailed flow for
—> extract data from a screenshot-based PDF using UiPath:

  1. Open UiPath Studio: Start by creating a new process or opening an existing one where you want to extract data from the PDF.
  2. Install Required Packages:
  • Go to “Manage Packages” in UiPath Studio.
  • Install the UiPath.PDF.Activities package.
  • Install the OCR engine of your choice (like UiPath.TesseractOCR.Activities for Tesseract).
  1. Read PDF with OCR Activity:
  • Drag and drop the “Read PDF With OCR” activity into your sequence.
  • In the properties panel, set the “FileName” property to the path of your PDF file.
  • Select an OCR engine from the “OCR Engine” dropdown (e.g., Tesseract OCR).
  • Configure the properties of the OCR engine (like “Scale” or “Languages”).
  1. Extract Data:
  • The “Read PDF With OCR” activity will output a string variable, which you should create and name (e.g., extractedText).
  • If you need to extract specific information (like dates, names, numbers), you may use an “Assign” activity with a Regex pattern to get the specific data.
    • Create a new string variable for the specific data (e.g., specificData).
    • Use System.Text.RegularExpressions.Regex.Match(extractedText, yourRegexPattern).Value to assign the value to specificData.
  1. Conditional Logic for Each Screenshot:
  • If there are two screenshots in the PDF, and you only need data from one, use an “If” activity to determine which part of the extractedText to process.
  • Use the “Substring” method or additional “Regex.Match” calls to isolate the part of the text you need from each screenshot.
  1. Output the Data:
  • Use “Write Line” activities to output the extracted data to the UiPath output panel for verification.
  • Alternatively, use “Write Range” or “Write Cell” activities to write the data to an Excel file or other formats as needed.

See the below output, I am getting only three. I don’t know why you are getting four… @Smitesh_Aher2

workflow -

Output -
image

This topic was automatically closed 3 days after the last reply. New replies are no longer allowed.