I have some pdf files can any one give an idea of extraction with unstructured data in pdfs without using any ocr is it possible? how?
OCR is used to get text/string from Scanned PDF file.
If you are trying to extract data from normal PDF file from which you are able to copy data directly, you can use Read text from PDF activity.
To get exact data fields, you can use RegEx.
Thanks,
Ashok
but i know this for basic things i can use regex what if i want to extract the line items
You can leverage Document Understanding for more reliable output than Regex.
Thanks,
Ashok
Hi @naveen5
If you have only digital PDF(not Image) then yes, you can extract using PDF activities you can achive that and for getting the specific data you can use Regex. But as you mentioned it’s a unstructured data, so there are high changes even break even if you use Regex. Because In Regex you make the pointers to extract, If that pointers are changed then the data won’t be extracted.
Either go with Document understanding or Abbyy Flexicapture is recommended in this scenario. But these are not free. Some cost is involved
Hope this helps you
Thanks,
Srini