Need to read a word in PDF file and if that word exists should remove that page and save the other pages

Hi,

Iam opening a pdf using a link in the portal and it directly opens in chrome without downloading in local machine. Now, my question is.

I want to Check for a word “Invoice” in the pdf pages and if it exists, in any of the page.

Then should remove that page and download the pdf to local machine.

If that is not possible on browser…please help me to tell when the pdf is in local machine.

Hi @maddy99

Please verify the above link !! it might help you !!

@maddy99 - You want to do is in StudioX or Studio?

It is in Studio

Sure thanks! will check and update you

Hi @maddy99 …If you allowed to use Balareva activities then i can suggest a solution…Please let me know…

@maddy99 - Here is sample workflow…

  1. Using 'Get PDF Page Count" and get the total # of Pages Say IntPageCount
  2. In the For each Loop, Enumerable.Range(1,IntPageCount) - This will loop through all the pages in the PDF…
  3. Inside the For Each First, Read the PDF Text and output to StrPDFText
  4. Next Assign Statement Match = System.Text.RegularExpressions.Regex.IsMatch(StrPDFText,“Invoice”,RegexOptions.IgnoreCase)

Match is Boolean Variable. Here i am looking for the word “Invoice”

  1. If Match is true and It mean search word is found on that page, so do nothing or Print anything you want , In the else part using “PDF Splitter” from BalaReva Activities Split that Particular page where the match is not found

21 Pages splitted, Page 22 has the word Invoice…

  1. Combine all the pages using “Join PDF Files” activity.
    image

That’s it…Done…

Hi @prasath17…I have installed Balareva pdf activities…

Hi @prasath17,

Sorry for late reply…

Thankyou for the flow…

Will Check this and update you

Hi @prasath17,

I have checked the work flow, but it is showing as every pdf has Invoice word in it…

It is giving result as true for all 10 pages

But I have only Invoice word in 1st page…

Everytime My invoice word will be on top right corner.

image

@maddy99…what is range of the read pdf activity? You should read page by page …for that, in the For each there is a Index component…declare a variable say IntIdx…Note: Index will always starts from 0. So in the read pdf property you have to set the range as (IntIdx+1).Tostring…same thing for PDfsplitter also

I will share the workflow.

Hi @prasath17

It’s Working…Thanks for the help… Will mark it as Solution…

1 Like

Hi @prasath17,

I was getting all the files in the final folder…without removing the invoice page…

Should I give range in pdf splitter activity?

@maddy99 - Please check this… Delete_PDFPage.zip (755.6 KB)

I have currently clean up…so if you run you will see 21 files gets created in the splitted folder and Final_output.pdf gets created outside/project folder.

Hi @prasath17,

I was getting all the files in the final folder…without removing the invoice page…

I think so, when it moves to else condition, there we are reading the whole pdf file and splitting it…May be that was the issue.

It was adding all the pages without removing invoice page…

Should I give range in pdf splitter activity?

image

I have checked your code and the activities are missing…

Could you please help me…

Hi @maddy99 …Did you get a chance to check my xaml? I guess you didn’t set up the range correctly that’s the issue…

Your setup Each Page will split all the page, you have to give page Range and Use (index+1).tostring…which will split page does not contain the match…

If you are still unable to resolve, then can you share your xaml?

Hi @prasath17,

Completed Now…Just didn’t gave the range…Iam Sorry…bit confused.

Thankyou!!..for the help…

hi @maddy99 … no problem…Glad it worked…

I purposely did not give the xaml initially because, in this way you can do a setup by looking at the screenshot. This way you will understand what’s going on.

Now, you could have got the idea, how it is working? Its simple,

Read pdf page by page → convert it to text–> do a regex for match → If match ignore that page → else/no match split that page → Finally combine all the splitted pages…

Instead of creating additional counter variable, I used the one comes with For Each(index) so that i dont have to increment it. Index will automatically increment for every read.

1 Like

Yes!! @prasath17… Learned how to read,remove and split pdf…and got total clarity after checking the shared flow…Thank you so much for your valuable time…

1 Like

This topic was automatically closed 3 days after the last reply. New replies are no longer allowed.