Can we extract the unstructured PDF data without using the OCR?

naveen5 · May 21, 2024, 6:36am

I have some pdf files can any one give an idea of extraction with unstructured data in pdfs without using any ocr is it possible? how?

ashokkarale · May 21, 2024, 6:48am

@naveen5,

OCR is used to get text/string from Scanned PDF file.
If you are trying to extract data from normal PDF file from which you are able to copy data directly, you can use Read text from PDF activity.

To get exact data fields, you can use RegEx.

Thanks,
Ashok

naveen5 · May 21, 2024, 7:02am

but i know this for basic things i can use regex what if i want to extract the line items

ashokkarale · May 21, 2024, 7:10am

@naveen5,

You can leverage Document Understanding for more reliable output than Regex.

Thanks,
Ashok

Srini84 · May 21, 2024, 7:20am

Hi @naveen5

If you have only digital PDF(not Image) then yes, you can extract using PDF activities you can achive that and for getting the specific data you can use Regex. But as you mentioned it’s a unstructured data, so there are high changes even break even if you use Regex. Because In Regex you make the pointers to extract, If that pointers are changed then the data won’t be extracted.

Either go with Document understanding or Abbyy Flexicapture is recommended in this scenario. But these are not free. Some cost is involved

Hope this helps you

Thanks,
Srini

Topic		Replies	Views
Extract specific data from pdf files Activities excel , pdf , activities , question	3	638	March 13, 2023
How to extract structure & unstructured(Invoice) data without using endpoint in Document Understanding Studio question	4	1238	May 8, 2022
I want to extract specific data in Scanned pdf file Activities ocr , activities , question	6	251	April 27, 2024
Scrape Text from Scanned PDF Help pdf , activities , data_scraping , question	11	2973	November 18, 2019
Data extraction from PDF file Studio activities , studio , question	5	1252	January 10, 2022

Can we extract the unstructured PDF data without using the OCR?

Related topics