Regex not working in READ PDF TEXT activity and Tesseract OCR reading numbers incorrectly in READ PDF WITH OCR in UiPath Studio

Samaleti_Harichandana · May 8, 2024, 6:23pm

Issue 1:

I used READ PDF TEXT to read the PDF file that I have. In the next step, when I try to capture data using REGEX, it works fine in the regex builder, regex storm .net tester, and regex101.com. But It is not extracting the required data from the text.

Example:
For repair service: 123456789

New Charges
Usage Charges: $ 7.24

Regex Pattern used: (?<=Usage Charges:\s*)$\s*\d+.\d+

the above pattern must capture $ 7.24

Issue 2:

Tesseract OCR in the READ PDF WITH OCR activity reads 23.99 as 23.00. How do fix this?

mukesh.singh · May 8, 2024, 7:54pm

@Samaleti_Harichandana welcome them to our community Family

Here are some potential reasons why your regex might not be working as expected in UiPath Studio’s “READ PDF Text” activity, even though it works in other regex testers:

Whitespace Encoding: The whitespace characters in the PDF might be encoded differently than what you’re assuming in your regex. Try using \s+ instead of \s* to match more types of whitespace characters.
Hidden Characters: The PDF might contain hidden characters that are not visible in the text preview but still affect the matching process. Experiment with different whitespace handling options in your regex engine (e.g., \s versus \w\W ).
Text Extraction Issues: There could be slight variations in how “READ PDF Text” extracts text compared to other tools. Double-check that the extracted text in UiPath Studio matches the text you’re testing in other tools.
Newline Characters: Be mindful of newline characters (\n ) before or after the target data. You might need to adjust your regex accordingly.

(?<=Usage Charges:\s+)\$\s*(\d+\.\d+)

Pls share sample file and code So Our UiPath Community family help you.

Yoichi · May 8, 2024, 11:32pm

Hi,

I don’t think the pattern works in regex101 as the following

It should be as the following, because $ and . is special character in regex.

(?<=Usage Charges:\s*)\$\s*\d+\.\d+

If it doesn’t work, can you share your text as file using WriteTextFile activity etc?

Tesseract OCR in the READ PDF WITH OCR activity reads 23.99 as 23.00. How do fix this?

For now, can you try OmniPage OCR? It works on local environment.

Regards,

mukesh.singh · May 9, 2024, 8:17pm

@Samaleti_Harichandana is your issue solved?

Samaleti_Harichandana · May 29, 2024, 1:23pm

Hi Mukesh,

Thanks for your response. I haven’t got the chance to try it. I will try it again by keeping tab of all your points and let you know if it worked.

Samaleti_Harichandana · May 29, 2024, 1:26pm

Hi Yoichi,

I will try out your suggestions.

I tried the OmniPage OCR, and it was taking a lot of time to just read the PDF and wouldn’t move forward. Anyway, I tested the Tesseract OCR by changing the Profile to different one than Legacy (this is by default) and it was extracting data fine after changing the Profile

Topic		Replies	Views
How to read the specific data in pdf Activities pdf , activities , question	33	4901	June 2, 2021
UIpath Matches activity BUG, regex builder error? Studio pdf , activities , regex , question	5	1342	January 17, 2022
Regex Based Extractor Not Working Activities ocr , activities , question	10	1315	February 17, 2021
How To Extract Data From PDF Using 'Read PDF Text' And RegEx ? Knowledge Base activities	0	501	August 8, 2023
Regex Based Extractor Not Extracting Data But Regex Builder Says It'll Work Document Understanding studio , regex , question	3	960	July 18, 2020

Most Active Users - Yesterday
Ajay_Mishra
ashokkarale
Abhi_Nande
Asantewaa_Mantey
mikko1
E.Y.9
Phenyo
More details...

Regex not working in READ PDF TEXT activity and Tesseract OCR reading numbers incorrectly in READ PDF WITH OCR in UiPath Studio

Related topics