Hi Team,
I have one screenshot pdf( 2 screenshots in one pdf) and want to extract specific data from it. please help me how to extract data?
Hi Team,
I have one screenshot pdf( 2 screenshots in one pdf) and want to extract specific data from it. please help me how to extract data?
Try to read the pdf by using Read PDF with OCR.
Drag and drop the Tesseract OCR in it, follow the below options for properties of OCR as below,
→ Select the Scan option in Profile.
→ Give the 2 value in Scale option.
The Output of Read PDF with OCR in String Variable. You can use regular expressions to scrap the results from Output data.
Check the below image for better understanding,
Hope it helps!!
Thanks @mkankatala
All data is extracted but i want to extract specific value.
Below is paragraph of screenshot,
For Eg:-
Try to read the pdf by using Read PDF with OCR.
Drag and drop the Tesseract OCR in it, follow the below options for properties of OCR as below,
→ Select the Scan option in Profile.
→ Give the 2 value in Scale option.
Give is the key and remaining highlighted part is value so how to extract value from paragraph?
Okay @Smitesh_Aher2
Here is the Regular expression,
The Give is the Keyword which is unique in data, I just given the regex like take the digits, spaces and words after the Give.
Check the below regular expressions,
Have you put this sended expression in assign activity? after read pdf with ocr activity?
Can you please send SS of how to use the expression after read pdf with ocr activity?
Assume the Input Variable is the output of Get PDF with OCR activity.
Use the expression as below in assign activity,
- Assign -> Output = System.Text.RegularExpressions.Regex.Match(Input,"(?<=Give\s+)[\d\s\w]+").Value
Hope you understand!!
Yes @mkankatala now i understand.
Can we use same sended expression for below paragraph?
Try to read the pdf by using Read PDF with OCR.
Drag and drop the Tesseract OCR in it, follow the below options for properties of OCR as below,
→ Select the Scan option in Profile.
→ Give the 2 value in Scale option .
I want to extract after the highlighted (Tesseract) remaining part.
The expression is same but the regular expression is changed.
This is the expression,
System.Text.RegularExpressions.Regex.Match(Input,“(?<=Give\s+)[\d\s\w]+”).Value
The highlighted part with in the double quotes is regular expression it will change in the expression based on which data to be extracted.
Which data you want to extract from pdf…?
Hope you understand!!
Paragraph is like below,
Try to read the pdf by using Read PDF with OCR.
Drag and drop the Tesseract OCR in it, follow the below options for properties of OCR as below,
Select the Tesseract Scan option in Profile.
Give the Tesseract 2
value in Scale option.
n number of Tesseract is present in paragraph so guess, Tesseract is key and remaining forwarded part is value.
i want to print all value in WriteLine activity one by one.
Let’s say as you like there are multiple matches in the Input then we want to get each one… @Smitesh_Aher2
Try the below one,
- Assign -> Matchedvalues = System.Text.RegularExpressions.Regex.Matches(Input,"(?<=Tesseract\s+).*")
If we want single match use the below one, which will get the first match,
System.Text.RegularExpressions.Regex.Match(Input,“(?<=Tesseract\s+).*”).Value
If we want all matched then use below expression,
System.Text.RegularExpressions.Regex.Matches(Input,“(?<=Tesseract\s+).*”)
Note : Matchedvalues is the Variable of IEnumerable<Matches>
datatype, which is the collection datatype to store the matched values of regex in the input data. To get the each item then use the for each activity to iterate the each value in MatchedValues. Pass the CurrentItem in write line activity to print one by one.
Check the below image there are 3 matches shown in the image,
Hope you understand!!
Hi @mkankatala
System.Text.RegularExpressions.Regex.Matches(Input,“(?<=Tesseract\s+).*”)
By using above expression i am getting 4 values in the output panel like below,
Try to read the pdf by using Read PDF with OCR.
OCR in it, follow the below options for properties of OCR as below,
Scan option in Profile.
2
I don’t want highlighted part in the output.
Detailed flow for
—> extract data from a screenshot-based PDF using UiPath:
UiPath.PDF.Activities
package.UiPath.TesseractOCR.Activities
for Tesseract).extractedText
).specificData
).System.Text.RegularExpressions.Regex.Match(extractedText, yourRegexPattern).Value
to assign the value to specificData
.extractedText
to process.See the below output, I am getting only three. I don’t know why you are getting four… @Smitesh_Aher2
workflow -
Output -
This topic was automatically closed 3 days after the last reply. New replies are no longer allowed.