Regex not working in READ PDF TEXT activity and Tesseract OCR reading numbers incorrectly in READ PDF WITH OCR in UiPath Studio

Issue 1:

I used READ PDF TEXT to read the PDF file that I have. In the next step, when I try to capture data using REGEX, it works fine in the regex builder, regex storm .net tester, and regex101.com. But It is not extracting the required data from the text.

Example:
For repair service: 123456789

New Charges
Usage Charges: $ 7.24

Regex Pattern used: (?<=Usage Charges:\s*)$\s*\d+.\d+

the above pattern must capture $ 7.24

Issue 2:

Tesseract OCR in the READ PDF WITH OCR activity reads 23.99 as 23.00. How do fix this?

@Samaleti_Harichandana welcome them to our community Family

Here are some potential reasons why your regex might not be working as expected in UiPath Studio’s “READ PDF Text” activity, even though it works in other regex testers:

  • Whitespace Encoding: The whitespace characters in the PDF might be encoded differently than what you’re assuming in your regex. Try using \s+ instead of \s* to match more types of whitespace characters.
  • Hidden Characters: The PDF might contain hidden characters that are not visible in the text preview but still affect the matching process. Experiment with different whitespace handling options in your regex engine (e.g., \s versus \w\W ).
  • Text Extraction Issues: There could be slight variations in how “READ PDF Text” extracts text compared to other tools. Double-check that the extracted text in UiPath Studio matches the text you’re testing in other tools.
  • Newline Characters: Be mindful of newline characters (\n ) before or after the target data. You might need to adjust your regex accordingly.
(?<=Usage Charges:\s+)\$\s*(\d+\.\d+)

Pls share sample file and code So Our UiPath Community family help you.

Hi,

I don’t think the pattern works in regex101 as the following

image

It should be as the following, because $ and . is special character in regex.

(?<=Usage Charges:\s*)\$\s*\d+\.\d+

image

If it doesn’t work, can you share your text as file using WriteTextFile activity etc?

Tesseract OCR in the READ PDF WITH OCR activity reads 23.99 as 23.00. How do fix this?

For now, can you try OmniPage OCR? It works on local environment.

Regards,

@Samaleti_Harichandana is your issue solved?

Hi Mukesh,

Thanks for your response. I haven’t got the chance to try it. I will try it again by keeping tab of all your points and let you know if it worked.

Hi Yoichi,

I will try out your suggestions.

I tried the OmniPage OCR, and it was taking a lot of time to just read the PDF and wouldn’t move forward. Anyway, I tested the Tesseract OCR by changing the Profile to different one than Legacy (this is by default) and it was extracting data fine after changing the Profile