How to get the extracted data in one single excel file for multiple formats of documents with extracted fields and tables

Dear All,

I have two different formats of documents from which i am extracting the information using document understanding and machine learning extractor . So while validating i am able to see the flow is able to extract the information from both of the documents . Also In both the documents i am extracting information from tables also .

But I am struggling how to get the extracted data in one single excel file . I am seeing only the data for the last document (second document in my case) getting extracted in the excel file.

Please help me . I am attaching the workflow scrrenshots and excel file



Please check your datatables before the merge activity, just for testing create write range for each of those datatables to make sure that they actually contain data. If they are okay, check you merge datatable activity properties if it is okay

@rahul.kumar3,

Could you please share the input file/document you are using. I will try to recreate the secenario.

Thanks,
Ashok :slight_smile:

Dear Ashok,

Please find the xaml file and the pdf files . Also please make sure you have dependencies installed while working on the workflow . I also found out that using logs i can see the 5 rows in total based on each of document in datasetResults.Tables(2).RowCount so only issue is with that the 2 rows are not getting populated in excel file only the three rows for last document is getting displayed in excel.
Main.xaml (32.1 KB)
Main.xaml (32.1 KB)
Main.xaml.json (141 Bytes)
project.json (1.9 KB)
Report.xlsx (7.6 KB)
Invoice 1.pdf (28.8 KB)
Invoice 873101.pdf (36.5 KB)

also you can use this thread to see when i was working with one of colleague here Ave but unfortunately i am still struggling with the logic and steps to achieve desired output :— https://forum.uipath.com/t/how-to-get-the-extracted-data-in-one-single-excel-file-for-multiple-formats-of-documents-with-extracted-fields-and-tables/729978?u=rahul.kumar3

@rahul.kumar3 I can help you approach what needs to be taken for extraction and you can try to discuss with your Sr on this and take a call.

Use the classification: During the Extraction use the DU Classification to extract the data and check track the record.
Store the data in DB: You can pull the data received from the invoice and store this data in DB with the Invoice File name and process Time.
Validate Data Point: I can see you are using an invoice, I could be possible that due to Bad Quality of the invoice, you can’t get the data then you have to send this to the validation station by checking the data point to a business user, once Bot received this data from validation station then you resend to DB.
Use SQL Joins Using the SQL Query you can get this data in one single data table and you can write the same into the Excel

Below are the additional tips that may be it helps you.

Consolidate extracted data from multiple formats (with fields and tables) into a single Excel file using UiPath and Document Understanding (DU). Here’s a refined approach that incorporates insights from potential issues and expert ratings:

Workflow Steps:

  1. Initialize Data Storage:

    • Create a single Excel file (e.g., ConsolidatedData.xlsx) where you’ll write the extracted data.
    • Consider the following DataTables to store extracted data:
      • FieldsDataTable: For storing extracted fields from each document. This will have columns for document ID (optional), field names, and respective values.
      • TablesDataTable: To store extracted tables from each document. This will have columns representing the table structure (headers) and table data rows.
  2. Iterate Through Documents:

    • Use a For Each File activity or similar looping mechanism to iterate through all your documents.
    • Use document type identification logic (if needed) to determine the document format (e.g., PDF, Word) for appropriate handling.
  3. Extract Data from Each Document:

    • Employ the appropriate DU activities based on document format:

      • For PDFs and unstructured documents: Use Extract Text with Tables.
      • For structured documents (e.g., Word with consistent layouts): Consider using pre-trained models (from Document Understanding AI Center) or training your own custom machine learning extractor.
    • Store extracted field data in the FieldsDataTable. For each document, create a new row with the document ID (optional) and extracted field name-value pairs.

    • Capture extracted tables in the TablesDataTable. Each document’s table will have a separate set of rows, including header names and table data.

    • Key Point: Ensure both DataTables have consistent column structures (field names, table header names) for seamless merging in the next step.

  4. Merge DataTables (Optional, Conditional):

    • If you need to merge extracted field data across documents before writing to Excel, use DataTable.Merge with appropriate column mappings. This is especially helpful if field names or structures might differ slightly between documents.
    • For tables, you can either merge them into a single master table with all rows (if formats allow) or write them as separate tables in the Excel file.
  5. Write Extracted Data to Excel:

    • Within the loop, use the Write Range activity to write the following data to separate sheets (or a single sheet with appropriate formatting):
      • Fields:
        • Write the FieldsDataTable to a sheet named “ExtractedFields”.
      • Tables:
        • If you have a single master TablesDataTable (from merging), write it to a sheet named “ExtractedTables”.
        • Alternatively, you can iterate through extracted tables per document and write each to a separate sheet within the Excel file, naming them based on the document or using a descriptive naming convention.

Additional Considerations:

  • Error Handling: Implement Try Catch blocks or validation checks to gracefully handle potential issues during extraction, merging, or writing to Excel. Ensure data integrity.
  • Data Validation and Cleaning: After writing to Excel, consider applying data validation rules (e.g., data types) or filtering if needed.
  • Excel File Management: If handling a large number of documents, you might want to consider creating a new Excel file for each set of documents (e.g., every 100 documents) to avoid memory limitations.

Example Code Snippet (Illustrative):

// Initialization
DataTable fieldsDataTable = new DataTable("Fields");
// Add columns for document ID (optional), field name, and field value
fieldsDataTable.Columns.Add("DocumentID", typeof(string)); // Optional
fieldsDataTable.Columns.Add("FieldName", typeof(string));
fieldsDataTable.Columns.Add("FieldValue", typeof(string));

DataTable tablesDataTable = new DataTable("Tables");
// Add columns representing the table structure (headers) and data rows

// Loop through documents
foreach (string filePath in fileList) {
    // Extract data using appropriate DU activities
    // ...

    // Store extracted fields
    DataRow newRow = fieldsDataTable.NewRow();
    newRow["DocumentID"] = documentID; // Optional
    newRow["FieldName"] = fieldName;
    newRow["FieldValue"] = fieldValue;
    fieldsDataTable.Rows.Add(newRow);

    // Store extracted tables (consider merging strategy)
    // ...

    // Write DataTables to Excel within the loop
    // ...
}

By following these steps and adapting the code snippet to your specific DU activities and data structures, you should be able to successfully consolidate extracted data from multiple document formats into a single, well-structured