Which PDF program is best for scraping data?

When I try to scrape data from a PDF I cannot select individual elements. The screen just treats the whole page like one element. I am running the latest beta of Studio and Adobe Acrobat XI v 11.0.20. Let me know what other information would be helpful.

1 Like

Hai @KCO_KJackson Check below link


1 Like

I forgot to include that I am scraping a form. Specifically, SF 1449 contract forms. I need multiple data fields such as the solicitation number, addresses, etc… Regex also won’t work because the stream of text isn’t equivalent to positioning on the form. I’ll put together a sample in a bit and upload it to provide clarity.

Adobe Acrobot Reader DC is best program, that way all elements in pdf are identified anf it’s open source.
Find below the link to download :slight_smile:)


And please note that you need to enable user elements in properties of the pdf.

1 Like

Not all the PDF files with Form can be used to extract data. You can open the file using Acrobat Reader DC->Edit->Preference->Click “OK”. Then you can use UiExplorer to try. You don’t need to change any setting in Acrobat Reader but it will work. You can try it using the sample file I attached.

Invoice_No_20180718001.pdf (34.9 KB)

1 Like

Both smallpdf and ilovepdf are worth a try.

When I was looking for such a program, on different forums many people recommended smallpdf, but I didn’t like it at all. Besides the lack of features, I found the interfaces, not user friendly at all. I found https://pdfliner.com/form_8962 by chance, and ever since, it’s the only program I use for scrapping data. You can easily select individual elements, edit, add details, and so much more. For me, it works best, and I really like it because it helps me be more organized and keep my documentation systematized.

How about trying document understanding? “scraping” pdfs is equal to trying to identify specific pieces of information from documents - and especially if this is a non-varying form, you might want to have a look into the Form Extractor or Regex Based Extractor.

There’s an academy course on document understanding that is pretty comprehensive, maybe it would be useful in our use case…