Document Understanding - Extract data from multiple pdf where doc structure is different for every file

Hi All,

I’m working on a small project where I want bot to get details from bus, train or flight tickets. I need data like date, Departure, Arrival and amount.

The Challenge I am facing is the ticket format is different depends on the travels or how the ticket is being booked.
I’m getting correct output for only file format which I have added in the template of Form Extractor. for other files its not working as expected, I’m not getting any error in my workflow. But output is null.

Also in Present Validation station Bot is not identifying the document type. Even if it is able to classify the document correctly.

Kindly note I’m not using invoices here, I’m using the travel tickets.

Can someone help me on this?

hey

please try with the regex extractor or ML extractor

regards

Hi @fernando_zuluaga ,

I have tried using both. ML Extractor the ML skill URL option is not available for my data. I used Invoices url but still it didn’t work. Regex Extractor I will try again and check.

Thanks.

sure, refer from this please

regards!

1 Like

Hi @vaishnavi_velayutham ,

Based on the Two points Mentioned :

At First Glance, we cannot make if there are a finite number of Ticket Formats, If there are a Finite Number, then maybe we could use Regex Extractor for each of the Ticket type present.

But if the Ticket formats are not fixed, meaning there can be n number of Ticket formats, Regex Extractor would not be so much of a Help in extraction unless there is always a Common Keywords for the values to Extract in all the different Tickets.

In this case, we would rely on the Document Understanding Model where we would have to Train the model with the Ticket formats available by using the Labelling feature. Then Generating the Dataset, Training the Model and Then Deploying it as an ML Skill and using it for Extraction in the ML Extractor.

Also, I do not think Invoices Model will be of use here. If you Could Provide us with few More details of the Ticket formats, maybe we could help you further. Mainly the Following Details :

  1. Tickets are Digital or Scanned?
  2. Ticket Keywords are Same in Different formats?
  3. Screenshot of the Ticket

Hi @supermanPunch ,

Yeah, Regular Expression is not working as expected.

I’m currently using Digital documents only. May be in the future I might try for image format.
Keywords are not same in all the tickets. It is different for every format.
For eg. in Flight tickets the keywords are different in IndiGo and SpiceJet airline tickets. We have N number of airline/operators like this. This is the challenge I’m facing.

Can you please provide me some extra information about training the dataset and provide it as a ML skill in ML Extractor, Any reference?

@vaishnavi_velayutham ,

As a Reference, we would first suggest you to go through the Documents available by UiPath on Document Understanding in AI Center.

The below posts Describes some Steps to perform the same :

Let us know if you still need further assistance after going through the Steps.

Hi there are multiple ways to do this extraction.

Option 1 - we can use form extractor.
Before form extractor you need to classify the document types using classification trainer and Classify document scope. Next you define both templates in form extractor.

Option 2- you can train the documents in AI center. for this you can use the document understanding OOTB model

I think given your scope of work you can try classify document scope with data extraction scope using the form extractor.

Thanks @sharon.palawandram @supermanPunch @fernando_zuluaga
I will try with the suggested solutions.

1 Like

you’re welcome.