How to read Highlighted or ticked data extract from PDF

Hello all

I need to extract the higlighted or ticked the data from Specified PDF.


Can anybody please help to do it?

use read pdftext activity and then write it into the text file then we can extract the data using regex

I need to take higlighted cell of Description and ICD10 from Second image
Acq Keratosis paimaris et dystrophy L85.1
Mycotic nails,multiple B3.5 and so n on…

It may vary to other documents

can u provide the sample pdf file

I have pasted sample file in the question

Can anybody please help me how to do it?

if possible can u post sample pdf file,above posted file is image file

Hi @divya.17290 ,

When working with data extraction from documents, we would firstly need to understand whether the document is going to be Digital or Scanned. In your case, need to know whether it is a Digital PDF or a Scanned PDF. Based on this, we could move forward with the appropriate suggestions.

Also, would need to understand what is the second image that you have provided ? Is it from the application ?

If your document is scanned and you would need to extract only the ticked data, then you would need to use Document Understanding where you would also need to train the DU model by labelling these datasets.

Or you could try alternative ways to get the data in the form of Digital document maybe by understanding how the document is being generated. Then we should be able to extracted the needed from the digital document.

Let us know your thoughts on the suggestions provided.

It will be a PDF file, the second one

A document contains multiple pages and each page is belongs to a different account number the enter the data per account number in one application.

From the page , i should extract highlighted cell (Right to Left).

@divya.17290 ,

It does seem to be a digital PDF, but we cannot confirm from the image. Could you confirm on this and let us also know if there are Checkboxes in the PDF.

Secondly, What is the purpose of the First Image ? Are we supposed to check the first Image and extract the ticked data from the Second PDF ? Is that the normal process or is there a condition being applied to extract data from the PDF in the second image ?

Do provide us more details on the normal process steps in order for us to understand better.

Regarding first image is currently using in Manual Process and second image is new one which i got development work.

I have attached sample pdf file. There are 3 table in one page.
Regarding First table–> i need to extract Right to Left data which is beside of Highlighted cell( like Circle)
Ex: Array={L85.1 , L6B35.1M20.22,E10.42,L85.3}

Second table – >I need to extract highlighted cell CPT and highlighted cell Modifier
Ex : CPT = 99213 , M2 =25

Third Table–> I need to extract Left to Right Highlighted cell data.
Ex: CPT = 10060 M1=LT
CPT= 11056 M3=50…

TEMP.pdf (115.9 KB)

Can anybody suggest me how to do it?

I have attached PDF, can you suggest me?

You have to use document understanding let me check this if i got the solution i will post here

Yes thank you.

I have tried with Document understanding concept but i got Object reference error while use Digitialize document activity and i have raised this another post How to fix Object reference error in Digitize document

Hi all

I need to extract shadded circle data from Digital PDF, can anybody suggest me how to do it?

Hi @divya.17290

Read PDF OCR activity can be used

Hello @divya.17290

For digital pdf you have to use the Read Pdf with OCR activity. You can try changing the ocr and see the accuracy.

If still not working , share a sample file here.

Thanks