Extracting information from large PDF file with Document Understanding

We are trying to extract data from a large PDF which has around 50 pages. Data needs to be extracted from almost all pages. PDF Template/layout is fixed. (No major handwritten data) Each page has some tables and some text/combo box fields. (Around 50-100 fields). Need to implement it without Action Center.

I have used keyword-based classifier, UiPath document OCR and Form Extractor to extract. Extraction seems to be fine.

  1. Because of the size of the PDF (50 pages and around 50 fields in each page) managing fields in the taxonomy manager is becoming difficult.

  2. Is there any better way of doing this? Do we see any challenges in this approach? Best practices to keep in mind while designing DU in such cases.

Hi @harsha_prakash ,

Could we also understand that if the PDF is a Digital Document or a Scanned Document or could be a mixture of both ? This is asked because the PDF template is mentioned as Fixed. We also ask to mention what type of document it is and what data is being extracted or the Taxonomy fields defined.

Thanks @supermanPunch for your response. It is not digital document. It is scanned document. It is US tax related document. We are extracting all the fields whatever user might have filled while filing tax. It includes tables as well. Also there can be many fields without any data.
Let me know if you need more info. Thanks again.

Then use read pdf text and then use regex to get values

Thanks @Usha_Jyothi for your response. This is not a digital document but scanned document which have many table data also. And many tables don’t have any headers. Along with tables, it also has many label and value kind of fields. And few values doesn’t have any labels . And the document is around 50 pages.
And So I was not sure if we can do such complex extraction using read pdf text and regex.


Welcome to the community

Form extractor if it is working then mostly the structure of each document might be same and tables will have similar amount of data

If that ai the case then without ai center ,form extractor would be the best option


Thanks @Anil_G for welcome and for the response. I will go ahead with Form extractor for now. Thanks again!

1 Like