Extract Pdf pages based on key words

Hi All,

  • I have a pdf of pages of 160 (Invoices Merged and page count May vary). I want to split pdf into multiple pages (single invoices) based on key words present in pdf.

  • Here I’m using for each loops taking so much time, looping each pdf page and searching that keyword taking around 25 minutes.

Is there any possibility to reduce the time by using any regex instead of loops.

Please help me.

Thanks & Regards,
Navya.

Hi @Navya_Nadakuduti

One thing you can try is…you can read all pdf’s at one… And if there is any reliable value like Invoice header or footer or page number field which would be there on all pages then split the string on that and then use System.Text.RegularExpressions.Regex.Match(EachpageString,“Regex for the string youa re searching)”).Tostring and delete which are not needed. This wat you wont interact with pdf always but only once to read and everything else is done with string that you already have read

cheers

Hi Anil,

Below is the workflow which i have created taking so much of time. In some cases invoice may be 2 pages.

Main.xaml (63.4 KB)
flipkart_invoices.pdf (465.9 KB)
project.json (1.5 KB)

Please help me.

Thanks & Regards,
Navya

Step 1: Import all libraries. Step 2: Convert PDF file to txt format and read data. Step 3: Use “. findall()” function of regular expressions to extract keywords.

Regards,
Will

Hi William,

Thanks for solution.

Can you please elaborate the above solution.

Best Regards,
Navya.

Hi @Navya_Nadakuduti

Can you try this

trypdf.xaml (10.1 KB)

cheers