Document Understanding - Extract data from multiple pdf where doc structure is different for every file

vaishnavi_velayutham · April 9, 2022, 6:42pm

Hi All,

I’m working on a small project where I want bot to get details from bus, train or flight tickets. I need data like date, Departure, Arrival and amount.

The Challenge I am facing is the ticket format is different depends on the travels or how the ticket is being booked.
I’m getting correct output for only file format which I have added in the template of Form Extractor. for other files its not working as expected, I’m not getting any error in my workflow. But output is null.

Also in Present Validation station Bot is not identifying the document type. Even if it is able to classify the document correctly.

Kindly note I’m not using invoices here, I’m using the travel tickets.

Can someone help me on this?

fernando_zuluaga · April 10, 2022, 3:26am

hey

please try with the regex extractor or ML extractor

regards

vaishnavi_velayutham · April 10, 2022, 3:44am

Hi @fernando_zuluaga ,

I have tried using both. ML Extractor the ML skill URL option is not available for my data. I used Invoices url but still it didn’t work. Regex Extractor I will try again and check.

Thanks.

fernando_zuluaga · April 10, 2022, 3:46am

sure, refer from this please

regards!

supermanPunch · April 10, 2022, 10:13am

Hi @vaishnavi_velayutham ,

Based on the Two points Mentioned :

At First Glance, we cannot make if there are a finite number of Ticket Formats, If there are a Finite Number, then maybe we could use Regex Extractor for each of the Ticket type present.

But if the Ticket formats are not fixed, meaning there can be n number of Ticket formats, Regex Extractor would not be so much of a Help in extraction unless there is always a Common Keywords for the values to Extract in all the different Tickets.

In this case, we would rely on the Document Understanding Model where we would have to Train the model with the Ticket formats available by using the Labelling feature. Then Generating the Dataset, Training the Model and Then Deploying it as an ML Skill and using it for Extraction in the ML Extractor.

Also, I do not think Invoices Model will be of use here. If you Could Provide us with few More details of the Ticket formats, maybe we could help you further. Mainly the Following Details :

Tickets are Digital or Scanned?
Ticket Keywords are Same in Different formats?
Screenshot of the Ticket

vaishnavi_velayutham · April 10, 2022, 12:00pm

Hi @supermanPunch ,

Yeah, Regular Expression is not working as expected.

I’m currently using Digital documents only. May be in the future I might try for image format.
Keywords are not same in all the tickets. It is different for every format.
For eg. in Flight tickets the keywords are different in IndiGo and SpiceJet airline tickets. We have N number of airline/operators like this. This is the challenge I’m facing.

Can you please provide me some extra information about training the dataset and provide it as a ML skill in ML Extractor, Any reference?

supermanPunch · April 10, 2022, 2:56pm

@vaishnavi_velayutham ,

As a Reference, we would first suggest you to go through the Documents available by UiPath on Document Understanding in AI Center.

The below posts Describes some Steps to perform the same :

Let us know if you still need further assistance after going through the Steps.

sharon.palawandram · April 11, 2022, 3:03am

Hi there are multiple ways to do this extraction.

Option 1 - we can use form extractor.
Before form extractor you need to classify the document types using classification trainer and Classify document scope. Next you define both templates in form extractor.

Option 2- you can train the documents in AI center. for this you can use the document understanding OOTB model

I think given your scope of work you can try classify document scope with data extraction scope using the form extractor.

vaishnavi_velayutham · April 11, 2022, 3:09am

Thanks @sharon.palawandram @supermanPunch @fernando_zuluaga
I will try with the suggested solutions.

sharon.palawandram · April 11, 2022, 4:42am

you’re welcome.

Topic		Replies	Views
Machine Learning Extraction with multiple PDF formats Document Understanding studio , question	21	1708	March 21, 2023
Facing issues while i am using document understanding to extract data from different invoices having different structure Something Else feedback	3	619	June 24, 2022
Multiple invoices with ML Extractor Document Understanding question , document_understanding	2	872	October 9, 2020
Need help in Document Understanding Something Else feedback	1	598	April 3, 2024
Using Form Extractor but shows not extracted in Present Validation Station Document Understanding form-extractor , invoices	5	491	July 7, 2023

Most Active Users - Yesterday
Anil_G
ashokkarale
jinal.shah
Gautham_Pattabiraman
postwick
chandreshsinh.jadeja
vrdabberu
Ajay_Mishra
sven.wullum1
Vyshnavi_Nalumachu
More details...

Document Understanding - Extract data from multiple pdf where doc structure is different for every file

Related Topics