How to find a keyword in larger pdf files

vigneshnkv · August 4, 2021, 9:20am

Hi,

My process is,
Search employee number in pdf.
if found, extract the page as separate pdf.

Pdf is larger in size (200-500mb) and pages count (30000 to 60000 pages.)

Read pdf text is taking too much time to read (more than 1 hour or so)
I tried opening the pdf and search the employee num (by ctrl+F). That also taking so much time (more than 1 hour for 60000 pages).

any easy way to extract particular page in the large pdf if keyword found within few mins?

sonaliaggarwal47 · August 4, 2021, 4:47pm

Hi @vigneshnkv,

Is employee number always going to be on same page no?
And within what range of pages, this is found? Limiting the range to those number of pages to be read should be helpful here.

Also, is this pdf a native pdf or scanned images are also there in it?

Regards
Sonali

vigneshnkv · August 4, 2021, 4:51pm

No. each page has different employee number. So it can be any employee number or in any page.

No range limit, it can be at any page.

pdf is digital/ native (readable), not scanned.

NIVED_NAMBIAR · August 4, 2021, 5:18pm

Hi @vigneshnkv
well i would suggest to read it page by page and within a loop
inside it iterate through each pages of pdf and check whether the pdf page contains that string
if yes break out of loop and save that page number as well

I think comparing with regarding all pages in pdf at once , this will save your time

I think so

try this way around

Regards,
Nived N

vigneshnkv · August 6, 2021, 8:56am

I have tried this method also.

Each page taking 8 seconds to check.
if so, approx It will take 80-90 hours to check one pdf which has 40000 pages

sonaliaggarwal47 · August 6, 2021, 1:31pm

Hi @vigneshnkv,

Considering the file size and your requirement, To be honest, I don’t think there can be any other quick solution to this.

I think its fine if it is taking an hour or so.

Some processes are bound to run for longer times basis their nature.

Regards
Sonali

vigneshnkv · August 9, 2021, 12:34pm

Thanks,
as of now, I am opening the pdf and searching the data (as we do manually) by BOT.
Start process - Open target file
Advanced Search - Search the keyword
wait until result come
once match found, get the page number and extract it

It is some what good in time consuming. if you have any ideas/ thoughts on this method. kindly add that.

sonaliaggarwal47 · August 9, 2021, 2:46pm

Hi @vigneshnkv,

Thank you for sharing details.

Could you try below:

Loop through each page separately.
Read pdf text(that page only… under Range specify the variable/counter that you are using to loop through all pages)… If pdf also contain images, use Read pdf with OCR instead.
Save its output
Include a logic to search through employee numbers you want to search within that resulted text.
If found, extract the page you are currently on, else move to next page.

Regards
Sonali

vigneshnkv · August 9, 2021, 2:49pm

Read pdf text: page by page loop:
Each page taking 8 seconds to check.
if so, approx It will take 80-90 hours to check one pdf which has 40000 pages

sonaliaggarwal47 · August 9, 2021, 2:52pm

@vigneshnkv,

Each page taking 8 seconds to check manually(by opening pdf and searching), right?

Regards
Sonali

vigneshnkv · August 10, 2021, 8:39am

No.

Read pdf text activity taking 8 seconds per page, if the file is large.

opening pdf and searching is taking 0.07 seconds (70 milliseconds) only. → I am going in this method now

sonaliaggarwal47 · August 10, 2021, 12:42pm

Hi @vigneshnkv,

Good to hear that, I just wanted you to try and make sure how long that route is taking.

So I think, we are back to our point that the approach currently being used is the right way and its bound to take lil long due to huge file size

Now that your doubts are cleared, I would suggest to mark solution so this topic can then be closed.

Regards
Sonali

Topic		Replies	Views
I need to find the number of the page where a keyword is located in a pdf. Do you know any activity that gives me the total number of pages that a PDF file has? Help activities , studio	12	3223	October 29, 2019
How to read very big pdf file Help pdf , activities	6	5225	January 25, 2018
Get specific page number based on keyword from PDF files Help pdf , activities , question	11	4177	January 8, 2021
Extract a particular Page data from multi page PDF document Help studio	6	3258	April 11, 2019
How to get the value of page number ,file name and how many results from PDF sarch? Activities pdf , activities , studio	3	888	October 7, 2022

How to find a keyword in larger pdf files

Related topics