Table/Page Classification in a PDF document

sumitd · February 5, 2024, 7:24pm

I am trying to extract 3 financial tables from annual financial reports (PDF) using the “FinancialStatements” ML model provided by UiPath (Document Understanding - Financial Statements - ML Package).

Data is mostly available in a tabular format, but the location of the table in the PDF is not fixed (could be on page 20 vs. page 230).

This is what I am currently facing:

I tried Intelligent Classifier as well as manual keyword classifier to classify the and obtain the 3 tables I am wanting in the ~300 page financial report.
When I present the full 300 page financial statement, the process is unable to classify the correct page and therefore data extraction fails.
If I present just the correct page (manually extract the page from the PDF) to the Classifier and the ML model, every thing works as expected and correct data is extracted.

Have you faced this issue? How did you get around it?
I am trying to understand the best approach to handle this type of an extraction from a ~300 page PDF.

Thank you!

Anil_G · February 5, 2024, 8:03pm

@sumitd

Extract the data separatley and then try to combine

The problem would be not the pages but the size of it…there are limitations on sizes

Cheers

jose.ordonez1 · February 5, 2024, 8:23pm

Hi Sumit,

You need to train the ML extractor. For more information check the following link

Hope helps you!

sumitd · February 5, 2024, 8:27pm

Thanks @Anil_G - could you explain your solution a bit more?

sumitd · February 5, 2024, 8:29pm

Thanks @jose.ordonez1 - the ML extractor is doing its job as extracted. Data is correctly extracted when only the correct pages are presented.

The classifier is unable to classify the correct page for data extraction.

Anil_G · February 5, 2024, 8:34pm

@sumitd

You wouls split pdf into pages and then send individual pges…if data is returned then table is presnet and you can extract if not table is not there move to next page

Cheers

jose.ordonez1 · February 5, 2024, 8:58pm

Hi Sumit,
Please check the following steps:

Go to 50_Extract workflow and validate if Data Extraction Scope is configured:

image922×1220 51.2 KB

NOTE: Both endpoint and API Key are available in AI Center (DataSets)

Hope helps you!

sumitd · February 5, 2024, 9:08pm

Yes, this is properly configured. Thanks!

jose.ordonez1 · February 5, 2024, 9:11pm

Hi Sumitd,
I have a quick question. Do you have data labeled all the input documents (Trainning and validation set) in Document Manager?, This is because I have had this problem before.

Cheers!

sumitd · February 5, 2024, 9:14pm

@Anil_G
How I am handling it today is by running a dispatcher on the document before presenting it to the extractor.

The dispatcher (using RegEx → and this is what I want to get away from and hoping Keyword Based Classifiers would do this) extracts the correct pages from the pdf.
The correct extracted pages are uploaded to a queue.
When the extractor runs, it gets the correct pages from the queue and extracts the data.

Now, why I am using RegEx is because I do not want to spend AI Units on pages that do not have data.
These PDFs can go up to 600 pages and we will be processing 100s of these every month.
Keyword Based Classifiers will consume 0 units to do this.
(Document Understanding - Metering and Charging Logic)

sumitd · February 5, 2024, 9:14pm

Yes, this is done as well!

srinivasmarneni · February 5, 2024, 10:46pm

HI,

Extracting specific financial tables from a large PDF document like an annual financial report can indeed be challenging, especially when the location of the tables is not fixed. However, there are several strategies you can employ to improve the accuracy and efficiency of your process using UiPath’s Document Understanding framework. Here’s a step-by-step approach:

Improve the Classification Process:

Train Your Classifier: If you have a set of similar financial reports, consider training the classifier with examples from these reports. This can help the classifier better understand the structure and format of your specific documents.
Use Regular Expressions (Regex): If the tables you’re looking for have specific titles or headers (like “Balance Sheet”, “Income Statement”), you can use Regex-based classifiers to identify the pages that contain these tables.
Custom ML Model: If the predefined models are not performing well, consider training a custom machine learning model specifically for your use case. You might need a dataset of labeled examples for this approach.

Optimize Data Extraction:

Fine-tune the ML Model: If the FinancialStatements ML model isn’t extracting data accurately, you might need to fine-tune the model with more examples, especially examples that are similar to the tables in your reports.
Use Anchors: If there are consistent elements near the tables (like specific texts or headings), use them as anchors to help locate the tables.

Process the Document in Chunks:

Split the Document: Instead of processing the entire 300-page document at once, consider splitting the document into smaller sections. This can be done based on sections or a fixed number of pages.
Parallel Processing: Process these chunks in parallel to speed up the overall processing time.

Post-Processing and Validation:

Review and Correct: Implement a review step where a human can quickly validate and correct the extracted data. This can also be used to further train your models.
Cross-reference: Use cross-referencing techniques to ensure the consistency and accuracy of the extracted data (e.g., total values matching across different tables).

Leverage UiPath Activities and Features:

Use Document Understanding Activities: Make sure you are utilizing all relevant activities in UiPath, such as ‘Data Extraction Scope’, ‘Present Validation Station’, etc.
Experiment with Different Models: Apart from the FinancialStatements model, try other pre-trained models or custom models that might be better suited for your specific data.

Feedback Loop:

Implement a feedback mechanism where the output of the extraction process is used to continuously improve the model. This includes retraining the model with new data or tweaking the rules and parameters based on the output.

sumitd · February 5, 2024, 10:56pm

Thanks @srinivasmarneni , I am training the Classifiers hoping this would solve.

RegEx Classifiers → Is this available as an activity?

I just see these:

srinivasmarneni · February 5, 2024, 11:01pm

You’re welcome!

As for the RegEx Classifier, it is not explicitly listed as a standalone activity in UiPath. However, you can use regular expressions within the “Keyword Based Classifier” or even in the “Intelligent Keyword Classifier” to classify documents based on patterns that match specific keywords or phrases.

If you’re looking to implement classification based on regular expressions more directly, you may need to use the “Matches” or “Is Match” activity in UiPath, which allows you to utilize regular expressions to search within strings. You can create a workflow that uses these activities to classify documents by checking for the presence of regex patterns.

To do this:

Use the “Matches” or “Is Match” activity to apply your regular expression to the text extracted from your documents.
Based on whether a match is found, you can then assign a classification to the document.

This kind of custom classification logic can be integrated into your overall Document Understanding process.

sumitd · February 5, 2024, 11:52pm

Yes, perfect - I am doing that already at a different point.
But, thank you for confirming the approach!

Topic		Replies	Views
Issue in Table data extraction using Document understanding Activities orchestrator , activities , document_understanding	8	1344	May 20, 2022
Invoice data extraction using document undertading Document Understanding studio , question , document_understanding , data-extraction , invoices	4	520	June 16, 2023
Document understanding multiple pdf with different page number Studio studio , question , activities_panel	7	602	March 27, 2023
Data table extraction by pdf Robot robot , question	1	746	November 21, 2022
Forms AI - Table Prediction of Scanned PDF Studio activities , studio , document_understanding , ai_center , data-extraction , forms-ai	1	1114	May 25, 2022

Most Active Users - Yesterday
Anil_G
mukesh.singh
postwick
anjani_priya
Anelisa_Bolosha1
More details...

Table/Page Classification in a PDF document

Related Topics