How do I extract data from a form, in pdf format, provided that it is not empty and stop extraction if it is empty? For example, I want to extract the details of the person in the form and stop if it is empty.
When handling data from documents, there are various ways to get it:
- Open the PDF and then scrape for the data - based on this you continue scrapping for the rest
- Use read PDF - and then use regex to match the needed field - and based on that you continue extracting the rest of the fields
- Use Document Understanding and handle from the results the data - seems that the document is a fixed form and FormExtractor will be very easy to configure
For method 1, I need to only scrape specific data such as Name, IC, Nationality, DOB and Address.
I am using method 2, without regex. I am using specific array numbers to extract the needed data.
For method 3, Document understanding has a limit on the total file size of the pdfs (My pdfs have 4 pages each) and I cannot extract empty fields.