PDF Document Form, Data Extraction, (Intelligent) Form Extractor

Hi,

I need to extract data regarding:

  1. Specifications
  2. Typical Properties
  3. Technical Features
  4. Application
    from attached documents as examples:
    ACURE-500_EN_A4.pdf (198.6 KB)
    ACURE-510-100_EN_A4.pdf (194.1 KB)
    ACURE-510-170_EN_A4.pdf (195.4 KB)

The middle of the document is set in two columns.
Sometimes there is only text to be extracted, another time a table.
The content ad 1-4 (as anchors) varies in length.

How can I do it with UiPath? I have already tried using Intelligent Form Extractor,
but don’t find the functionality to support it. Am I using it wrong?

Thx for any suggestions,
Vanja

@VanjaV Please create multiple taxonomy and train your intelligent classifier and then intelligent form extractor on all 3types of PDFs. The more types/variations of documents , you will train the easier it will be for Document Understanding to extract data correctly.
Please have a look at the video below and make sure you are using all components and following all steps.

1 Like

@shetanshudhar Very big thx for advice.

I don’t know the logic of UiPath ML Trainer. Let’s say:

  1. I define 1st template with Technical Features and corresponding text box (custom area) size 6 wide and 2 tall.
  2. Then I define 2nd template with Technical Features and corresponding text box (custom area) size 6 wide and 4 tall.
    If workflow receives 3rd type of document (that is not defined as template) with Technical Features and corresponding text box (custom area) size 6 wide and 3 tall, how will it respond? Will it be recognized?

@VanjaV it will only recognise the templates for which it has been trained. for others it might recognize some content but it would not be a 100%.

shetanshudhar thx.

Sorry, but I don’t get it. Why do you have to train the documents, if it only works with templates?

To get all possible combinations in mentioned case I need to prepare around 30-50 different templates.
Then, I have same type of document to be applied for more than 100 companies, which multiplies the templates number by indefinite number of combinations.

Why can’t I just define the custom area between two paragraph headers (as anchors) either as text or a table? What is the difference between Form Extractor and Intelligent Form Extractor?