Extracting information from large PDF file with Document Understanding

harsha_prakash · August 13, 2023, 8:12am

We are trying to extract data from a large PDF which has around 50 pages. Data needs to be extracted from almost all pages. PDF Template/layout is fixed. (No major handwritten data) Each page has some tables and some text/combo box fields. (Around 50-100 fields). Need to implement it without Action Center.

I have used keyword-based classifier, UiPath document OCR and Form Extractor to extract. Extraction seems to be fine.

Because of the size of the PDF (50 pages and around 50 fields in each page) managing fields in the taxonomy manager is becoming difficult.
Is there any better way of doing this? Do we see any challenges in this approach? Best practices to keep in mind while designing DU in such cases.

supermanPunch · August 13, 2023, 8:31am

Hi @harsha_prakash ,

Could we also understand that if the PDF is a Digital Document or a Scanned Document or could be a mixture of both ? This is asked because the PDF template is mentioned as Fixed. We also ask to mention what type of document it is and what data is being extracted or the Taxonomy fields defined.

harsha_prakash · August 13, 2023, 9:56am

Thanks @supermanPunch for your response. It is not digital document. It is scanned document. It is US tax related document. We are extracting all the fields whatever user might have filled while filing tax. It includes tables as well. Also there can be many fields without any data.
Let me know if you need more info. Thanks again.

Usha_Jyothi · August 13, 2023, 9:57am

Then use read pdf text and then use regex to get values

harsha_prakash · August 13, 2023, 10:20am

Thanks @Usha_Jyothi for your response. This is not a digital document but scanned document which have many table data also. And many tables don’t have any headers. Along with tables, it also has many label and value kind of fields. And few values doesn’t have any labels . And the document is around 50 pages.
And So I was not sure if we can do such complex extraction using read pdf text and regex.

Anil_G · August 13, 2023, 2:44pm

@harsha_prakash

Welcome to the community

Form extractor if it is working then mostly the structure of each document might be same and tables will have similar amount of data

If that ai the case then without ai center ,form extractor would be the best option

Cheers

harsha_prakash · August 14, 2023, 8:52am

Thanks @Anil_G for welcome and for the response. I will go ahead with Form extractor for now. Thanks again!

Topic		Replies	Views
I would like to raise a query regarding the Document Understanding process Activities activities , question , document_understanding	4	21	June 15, 2025
Not able to extract all pdf page data using Document Understanding Something Else feedback	4	751	June 28, 2022
Document understand field extraction issue Activities question , document_understanding	2	876	September 2, 2021
Extract data from multiple pdf,need guidance Activities pdf , activities , question	0	844	December 22, 2021
Extracting tables with varying number of items from pdf using Document Understanding Studio studio , question , document_understanding , activities_panel , table-extraction	9	2187	March 14, 2022

Extracting information from large PDF file with Document Understanding

Related topics