Can we extract the unstructured PDF data without using the OCR?

I have some pdf files can any one give an idea of extraction with unstructured data in pdfs without using any ocr is it possible? how?

@naveen5,

OCR is used to get text/string from Scanned PDF file.
If you are trying to extract data from normal PDF file from which you are able to copy data directly, you can use Read text from PDF activity.

To get exact data fields, you can use RegEx.

Thanks,
Ashok :slight_smile:

but i know this for basic things i can use regex what if i want to extract the line items

@naveen5,

You can leverage Document Understanding for more reliable output than Regex.

Thanks,
Ashok :slight_smile:

Hi @naveen5

If you have only digital PDF(not Image) then yes, you can extract using PDF activities you can achive that and for getting the specific data you can use Regex. But as you mentioned it’s a unstructured data, so there are high changes even break even if you use Regex. Because In Regex you make the pointers to extract, If that pointers are changed then the data won’t be extracted.

Either go with Document understanding or Abbyy Flexicapture is recommended in this scenario. But these are not free. Some cost is involved

Hope this helps you

Thanks,
Srini