How to extract data using 'Read PDF text' and RegEx?
In some specific scenarios, some OCR, ML or other PDF activities to extract data from PDF do not work as expected. If it is required to extract a specific value from a PDF, an alternative is to use the activity 'Read PDF text' from the UiPath.PDF.Activities package and RegEx.
Perform the below,
- First step would be using the 'Read PDF Text' activity. This will extract all the text from one specific PDF, and you will be able to save the text in a output variable.
- Post storing al the text in the output variable, extract the specific value required by setting up a pattern, for this we use regular expressions. Here is an example of how to use the regular expression:
Consider the example: Extract a value that is after the word "Total".
- Search for the activity 'Find Matching Patterns'
- In the properties panel, set the following string value for the Pattern parameter:
"Total\s*([\d.,]+)"
The text 'Total' followed by zero or more whitespace characters (\s*), and then captures a sequence of one or more digits possibly separated by periods or commas ([\d.,]+). The parentheses create a capture group that is returned by the Matches activity.
- For the result, create a new variable to hold the match result, say matchResult.
- A new message box is added to verify if matchResult has the correct value.