How get policy numbers from one pdf contains 2000 pages

Hi ,

I need to extract policy numbers from a PDF file .

I used read pdf with OCR (Tesseract OCR) (reading only 15 pages out of 2000 pages)

But i am not getting the POLICY number .

GET OCR PROPERTIES

@allurai_india

Could you please tell more details about the issue.

Is the policy number available in all pages or its available in only one page.

Policy Number is avilable in all the pages

Hi @allurai India

Based on the text you can extract it based on regex expression

Thanks
Ashwin.S

Okay .

1)i have done “read pdf with OCR”
2)stored all the 15 pages in variable “PLAN” (String) —
3)following is my regex expression
INPUT : PLAN
Pattern :
“((?<=PLAN NAME::).*(?=POLICY #::))”

  1. But I am getting following error :

image

@allurai_india Can you share example string and expected output from the string


Above string has been saved in “PLAN”

i Tried below as well for POICY #

But no luck …

“((?<=POLICY #:).*(?=SERVICING AGENT:))”

@allurai_india Can you share format with dummy data it’s difficult to write regex by seeing the screenshot.

Please find dummy data

I used find POLICY number with pattern : “POLICY #:\D(.*)”

But no hopes.

*******Life Company
PO Box ****************************
Phone ******* Fax *******
ANNUAL REPORT
From NOVEMBER 15, 2017 to NOVEMBER 14, 2018
********** .
******** PLAN NAME: *
************* POLICY #: *
SERVICING AGENT: *

Hi @allurai_india,

Try to split string by word like PLAN.SPLIT(new string() {“POLICY #:”},StringSplitOptions.None)[0]. Assign this to a string variable.

I believe, policy numbers will have fixed number of characters, then you can using substring to read the first X number of characters from above variable.

It is just one of way to get the desired text from PDF content. May be there are other ways too :slight_smile:

Thank you
VJ

1 Like

thanks .

I am able to get only one policy . with following .

POLICY #:\D(.*)

since i have policy numbers repeated (more than 1000) . I would like to group them .

(?(POLICY #:\D(.*)))/g

Hi @allurai_india

If you still looking for a solution,
You can try following regex pattern : POLICY\s*#:\s*\b\d+\b

Definition :
#POLICY #Match the word policy
#\s* #Escape whitespaces - Zero or more
##: #Escape hash character
#\s* #Escape white space after # - Zero or more
#\b #word boundary - beginning of the word
#\d+ #Match one or more digit
#\b #word boundary - End of the word

Post which you should have all the matches in results variable in Matches activity in UiPath.
Use for each loop to iterate through each match and use it further in your workflow.

Thank you
VJ

1 Like

thanks .

I am able to get only one policy . with following .

POLICY #:\D(.*)

since i have policy numbers in Mort than 2000 pages .

> I would like to group them .**`

right now …this expression is giving only first POLICY NUMBER

BTW …

below is the “Result” variable :
iEnumResult

and
I am assigning “strMetric” to “iEnumResult (0).ToString()”

image

hI

Is there any chances i can increment the result ilike this ?.. i tried to put ineger value in a loop …but failed … iEnumResult (i).ToString()

iEnumResult (0).ToString()
iEnumResult (1).ToString()
iEnumResult (2).ToString()
iEnumResult (3).ToString()
iEnumResult (4).ToString()

thanks for giving idea on matches.

finally i am done with my task after 3 days of struggle …

i kept the 'iEnumResult (i).ToString()" in form looop …and added this to a data table…

and i ecported the same to CSV file …now i can able to pull all the polcies from 2000 pages at one go …
thanks @vijayakumarkj, @indra

This topic was automatically closed 3 days after the last reply. New replies are no longer allowed.