Pdf extraction from multiple files

Hi,

Need to data extraction from pdf files depending upon engine model name not able to do for multiple files able to do for single file. Please help.
Sequence pdf3-.zip (7.5 KB)

Thanks & Regards,
Lakshmi

Hello @lakshmi.mp

Can you share the screenshot of the 2 pdf here.

Also is this format the static one? If yes, you can use Read pdf text activity and use regex to fetch the required data. Else you can open each pdf, using use application/Browser and then use Get text to extract the required data.

Hi @Rahul_Unnikrishnan ,

Pdf files are not static, in each file some model number’s are there depending upon that number i need to do extraction. That model number and extraction details are stored in switch case.
I have attached my workflow above, please look on it.
Depending upon some condition that files will be moved into some folder.

Thanks,
Lakshmi

Hi @Rahul_Unnikrishnan

test.txt (399 Bytes)

Here i have attached the text file, of pdf page 1 where i need to extract engine model , TTSN, TFSN, TSSN.

Thanks ,
Lakshmi

please try to use arrayvar= directory.getfiles(“filepath”)

and you can use for each for iterate output from the array , each item will be your file path use read pdf with ocr and pass item as a input

Hi,

Can anyone help me in extracting the values from above text file, i tried using lookahead but only engine model and TTSN values not able to extract TFSN and TSSN values.

Thanks,
Lakshmi

Hi @lakshmi.mp ,

As you have mentioned, there are different files, does it mean the formats are different in each of the them ? If so, do we have the set of different formats that are to be expected or is it not known or indefinite ?

Once we have the above details, we could try to understand the pattern required for the extraction and use it. Do we have anything in text file that is constant or is relative to the engine model values?

It would also be better if you could highlight the values needed for extracting.

Hi @supermanPunch ,

Below highlighted values need to be extracted,
test.txt (399 Bytes)


In the same pattern too many files are there. I am trying to extract its showing blank data when workflow is executed.

Thanks,
Lakshmi

Hi @supermanPunch
Seq pdf3.zip (5.9 KB)
This is my workflow. Please look on it.

@lakshmi.mp ,

If the format is going to be the same throughout all PDF files, then we have the below expression for the Engine Model :

(?<=Engine Model\n).*

System.Text.RegularExpressions.Regex.Match(pdfText,"(?<=Engine Model\n).*",System.Text.RegularExpressions.RegexOptions.IgnoreCase).Value.ToString

TTSN, TFSN, TSSN :

(?<=TTSN\s)(.*)TFSN(.*)TSSN(.*)

We could use one Expression for extracting TTSN, TFSN and TSSN values in groups.
But we do see that there are multiple values for these fields. If multiple values are required then we would need to use Matches instead of Match, then iterate and fetch the values.

For Extraction of the First Match we could use Regex.Match in the Below way :
TTSN :

System.Text.RegularExpressions.Regex.Match(pdfText,"(?<=TTSN\s)(.*)TFSN(.*)TSSN(.*)",System.Text.RegularExpressions.RegexOptions.IgnoreCase).Groups(1).Value.ToString

TFSN :

System.Text.RegularExpressions.Regex.Match(pdfText,"(?<=TTSN\s)(.*)TFSN(.*)TSSN(.*)",System.Text.RegularExpressions.RegexOptions.IgnoreCase).Groups(2).Value.ToString

TSSN :

System.Text.RegularExpressions.Regex.Match(pdfText,"(?<=TTSN\s)(.*)TFSN(.*)TSSN(.*)",System.Text.RegularExpressions.RegexOptions.IgnoreCase).Groups(3).Value.ToString

Check the above expressions and let us know if it doesn’t work.

3 Likes

Hi @supermanPunch ,

I am not able to extract the engine model, in regex builder its highlighting but in workflow its not showing blank.
Sequence.zip (1.8 KB)

Workflow has been attached, please look on it.

Thanks,
Lakshmi

@lakshmi.mp ,

A small modification to the Expression :

(?<=Engine Model\r?\n).*

Could you check with the above expression and let me know if it works.

1 Like

Hi @supermanPunch ,

Above expression working for 2 files not working for
image
this file, need to extract only AB6L-3AZ but its coming AB6L-3AZ Build Spec,
test.txt (286 Bytes)
can we pass 2 regular expression for extracting single word, please help.

thanks,
Lakshmi

Hi,

Can we pass 2 regular expression for extracting single word, facing difficulty in extracting engine model. please help.

thanks,
Lakshmi

@lakshmi.mp , If only a Single Word after Engine Model needs to be extracted, then could you maybe split the Extracted value with space, then take only the first element ?

We can do it like below :

Split(extractedValue)(0).ToString.Trim

Could you check it in this way ?

1 Like

Hi @supermanPunch ,


Need to extract engine model highlighted part
Tried this expressions (?<=Engine Model\r?\n).* and Split(extractedVar)(0).ToString.Trim but not able to extract.
It can be extracted by positive lookbehind but here files are changing, not able use positive lookbehind.
In previous files engine model number was on next line but in this file engine model is on same line.
Please help.

Thanks,
Lakshmi

Hi,

(?<=Engine Model\r?\n).* [This expression works for all files except 2 files]
(?<=Engine Model ).* =>This expression works for 2 files, how to combine 2 expressions for extracting single word.
Please help.

Regards,
Lakshmi