How get policy numbers from one pdf contains 2000 pages

allurai_india · September 25, 2019, 5:30pm

Hi ,

I need to extract policy numbers from a PDF file .

I used read pdf with OCR (Tesseract OCR) (reading only 15 pages out of 2000 pages)

But i am not getting the POLICY number .

GET OCR PROPERTIES

lakshman · September 25, 2019, 5:34pm

@allurai_india

Could you please tell more details about the issue.

Is the policy number available in all pages or its available in only one page.

allurai_india · September 25, 2019, 5:34pm

Policy Number is avilable in all the pages

AshwinS2 · September 25, 2019, 6:05pm

Hi @allurai India

Based on the text you can extract it based on regex expression

Thanks
Ashwin.S

allurai_india · September 26, 2019, 10:24am

Okay .

1)i have done “read pdf with OCR”
2)stored all the 15 pages in variable “PLAN” (String) —
3)following is my regex expression
INPUT : PLAN
Pattern :
“((?<=PLAN NAME::).*(?=POLICY #::))”

But I am getting following error :

indra · September 26, 2019, 10:27am

@allurai_india Can you share example string and expected output from the string

allurai_india · September 26, 2019, 10:42am

Above string has been saved in “PLAN”

allurai_india · September 26, 2019, 11:02am

i Tried below as well for POICY #

But no luck …

“((?<=POLICY #:).*(?=SERVICING AGENT:))”

indra · September 26, 2019, 11:22am

@allurai_india Can you share format with dummy data it’s difficult to write regex by seeing the screenshot.

allurai_india · September 26, 2019, 2:41pm

Please find dummy data

I used find POLICY number with pattern : “POLICY #:\D(.*)”

But no hopes.

*******Life Company
PO Box ****************************
Phone ******* Fax *******
ANNUAL REPORT
From NOVEMBER 15, 2017 to NOVEMBER 14, 2018
********** .
******** PLAN NAME: *
************* POLICY #: *
SERVICING AGENT: *

vijayakumarkj · September 26, 2019, 3:19pm

Hi @allurai_india,

Try to split string by word like PLAN.SPLIT(new string() {“POLICY #:”},StringSplitOptions.None)[0]. Assign this to a string variable.

I believe, policy numbers will have fixed number of characters, then you can using substring to read the first X number of characters from above variable.

It is just one of way to get the desired text from PDF content. May be there are other ways too

Thank you
VJ

allurai_india · September 26, 2019, 3:45pm

thanks .

I am able to get only one policy . with following .

POLICY #:\D(.*)

since i have policy numbers repeated (more than 1000) . I would like to group them .

(?(POLICY #:\D(.*)))/g

vijayakumarkj · October 3, 2019, 11:09am

Hi @allurai_india

If you still looking for a solution,
You can try following regex pattern : POLICY\s*#:\s*\b\d+\b

Definition :
#POLICY #Match the word policy
#\s* #Escape whitespaces - Zero or more
##: #Escape hash character
#\s* #Escape white space after # - Zero or more
#\b #word boundary - beginning of the word
#\d+ #Match one or more digit
#\b #word boundary - End of the word

Post which you should have all the matches in results variable in Matches activity in UiPath.
Use for each loop to iterate through each match and use it further in your workflow.

Thank you
VJ

allurai_india · October 3, 2019, 1:22pm

thanks .

I am able to get only one policy . with following .

POLICY #:\D(.*)

since i have policy numbers in Mort than 2000 pages .

> I would like to group them .**`

right now …this expression is giving only first POLICY NUMBER

BTW …

below is the “Result” variable :
iEnumResult

and
I am assigning “strMetric” to “iEnumResult (0).ToString()”

allurai_india · October 3, 2019, 1:55pm

hI

Is there any chances i can increment the result ilike this ?.. i tried to put ineger value in a loop …but failed … iEnumResult (i).ToString()

iEnumResult (0).ToString()
iEnumResult (1).ToString()
iEnumResult (2).ToString()
iEnumResult (3).ToString()
iEnumResult (4).ToString()

allurai_india · October 3, 2019, 3:44pm

thanks for giving idea on matches.

finally i am done with my task after 3 days of struggle …

i kept the 'iEnumResult (i).ToString()" in form looop …and added this to a data table…

and i ecported the same to CSV file …now i can able to pull all the polcies from 2000 pages at one go …
thanks @vijayakumarkj, @indra

system · October 6, 2019, 3:49pm

This topic was automatically closed 3 days after the last reply. New replies are no longer allowed.

Topic		Replies	Views
PDF SEGREGATION 1 Studio studio , question , tools	5	416	February 3, 2023
Regex to fetch only number Help studiox , question	22	1579	December 4, 2020
How to get only numbers from PDF file? Help pdf , ocr , activities	8	12666	May 8, 2018
In Pdf i am reading each page and extracting data . There are multiple pages ,based on receipt number i need to split pdf. Receipt num contains 25 digits ie NSCI129NSCI12000020257668.Can you help getting number with regex come times I is reading as1 Activities activities , question , document_understanding	5	759	April 25, 2022
Specific Data from PDF sheet Help	30	1757	September 2, 2019

Most Active Users - Yesterday
Yoichi
Anil_G
SorenB
sven.wullum1
jast1631
takehiro.ichikura
sharazkm32
A_Learner
ashokkarale
pradeep-shukla
More details...

How get policy numbers from one pdf contains 2000 pages

Related topics