Extract from multiple pdf

I have multiple pdf files with same layout.I have to extract 3-4 fields.but it has to be dynamic for all files.let me give an example…in all files I have to extract reference number that could be any number.I have to extract that number.lets assume that in that file it is written like this-“reference number for file one is 22. Check its validation”.and in another pdf it is written-“reference number given in file two is 32. Check its validation”.I have to extract the numbers which are 22 and 32.how can I extract for all files dynamically with good exception handling?using string manipulation or regex or which method is suitable?pdf structure is same,but in between the words “reference number” and 22 or 32 ,we could get dynamic words.how to extract?

for read pdf you can use read pdf with ocr

Hello @ydash999 If it a normal pdf then you can read the data to text format using Read PDF Text activity then apply regular expressions. If it is a scanned one you can use Document Understanding to retrieve the data. If possible can you share a sample pdf

Hi @ydash999

Since the PDF structure is same we can make use of regex to extract the required field

Try to build the regex in such a way that “reference number” is included and extract only number from it.

One such example

image

Also, reduce your input string, try not to include the whole pdf string, rather only those part part of the string where the possibility of occurrence of reference number is high.
You can split your whole pdf text into parts, and use only that part to extract the reference number where it generally occurs

It is scanned PDF…I have to do using string manipulation

Thanks bro.between reference and the number,se other words might be added too…I have to extract only the number.
The process of urs looks great.how to do exception handling of this? What errors could come if i am deploying to the production?

Hi @ydash999

With regex you may get null reference exception if no value is extracted and you try to assign it to some variable or try to use the value.

Build it, test it and you’ll get to know about the exceptions that you have to deal with.

1 Like

.* Is not tking inside bracket.it is showing error as-" quantifier inside a look behind makes it nonfixed width"

image

1 Like

In regex101 it is not showing.

This topic was automatically closed 3 days after the last reply. New replies are no longer allowed.