This is my UI Path Workflow by referring to Youtube Guru.
I encountered issue when i try to use document understanding ML to extract the row content from the PDF table. I’ve setup the template manager as well.
Some of the columns are neat, mean the items extracted out are the item required by me.
But some of the column contain other information. How do i extract only the required
information ? (E.g Description column is not neat, it contains extra information)
Some of the row height might chance from PDF A to PDF B, how to address this issue ?
For example, some of the item maybe have longer Description, hence the height of that row is
bigger.
Previously, i’ve tried using Regex to extract data, but it seems difficult
Photo below shows the area that i wish to extract from the table (Especially the Internal P/N)
Could you let us know if the PDF documents would always be Digital ? If not, Then would suggest to use Document Understanding with ML Extractors (Purchase Orders ML Package/Endpoint) as Form Extractors would not be able to capture the dynamic rows that would appear in the PDF at different times/for different inputs.
If it is Digital always, we could try to check with Regex or Regex Extractor, but we would need to get some sample data and confirm on whether it is really feasible to use it. If there are definite patterns in identifying the rows in the Extracted text from Read PDF Text activity, it should be possible using regex expressions.
If you are confartable what I normal use its ItextSharp library in c# and invoke code to get all pdf data and fields. Then you can create an algorithm to do what you need
The pdf is digital version. However, as you see by referring to below 2 photos, the parts highlighted in yellow are the required parts. Those with red color arrow are not store individually, which make the extraction using form extractor difficult.
Do you think it’s possible with the help of Regex ?
Thank you for your suggestion. I have no knowledge with itextsharp library.
Any other suggestion that i can use ?
E.g something that can allow system to know for each of the line item, which parts are required (qty, part number, unit price), then it can extract intelligently even there is a change of pdf row size
If you have access to AI center then you can train the documents on ML model and try to extract as each fiels has its owns identifier beside it…and its a continuous table
Apart from that can go with regex but it would be little difficult…fir this …first read all the data into text file and then use try defining eqch ptternt
Hat would extract
Hi Anil, thank you . I’m currently using UiPath community version, not sure if I have access to AI Center. But I have the access to API for those extractor ( form extractor in this case ).
Regex is very difficult despite I’m having fixed line pattern 00010, 00020 , but the starting word and ending word of these line are not always the same , so it’s kinda difficult
To extract only the required information from a PDF table using Document Understanding ML in UiPath, you can use a combination of the siffernet extraction methods
Like it’s starts from
Using a taxonomy to define the structure of the table. This will help the ML model to identify the different columns in the table and the types of data that they contain.
Use anchors to identify the start and end of each row in the table. This will help the ML model to extract the data from each row accurately, even if the row height varies.
Use regular expressions to extract specific pieces of information from the extracted data.** For example, you can use a regular expression to extract the product name from the Description column.
A quick eg:
Create a taxonomy to define the structure of the table. The taxonomy should define the following:
The names of the columns in the table.
The types of data that each column contains.
Create a Document Understanding ML template and configure it to use the taxonomy that you created in step 1.
Use the Document Understanding ML activity to extract the data from the PDF table.
Use regular expressions to extract the specific pieces of information that you need from the extracted data.
For example
(?<productName>[A-Za-z0-9 ]+)
This regular expression will match any sequence of alphanumeric characters and spaces in the Description column. The captured group productName will contain the product name.
You can use a similar approach to extract other pieces of information from the extracted data.
If you are having difficulty using regular expressions, you can use the UiPath Regex Tester activity to test your regular expressions.
As already mentioned, you could rule out the Form Extractors considering that there are dynamic rows. But before going to check with Document Understanding, would suggest you to check out on the feasibility by understanding the fields needed to be captured.
If we would be able to identify a common pattern among the fields in different samples, then it should be possible using regex/String manipulations.
However, If there are different templates, then would suggest to go for Document Understanding.
Thanks for your advice, if I have fixed row height , but with variation in row number , mean each page may contain 0 to max 3 rows, then what is the best way if I want to extract the value highlighted in yellow + indicated in red arrow (most important) ?
We do see a pattern on the values that are to be highlighted, Is it possible to share the Sample PDF or the Extracted Text from Read PDF Text (Preserve Format enabled) activity ?
We could provide you with the regex expressions/workflow solution by testing it from our end.
Appreciate your kind help with above sample. I tried to paste above PO in chat gpt, and then get the regex and valid it in regex / UiPath, but i am unable to get all the regex that can capture the desired results (highlighted in yellow + indicated in red arrow)
For the sent sample, we are able to get the required Expected Output. You could validate it and test it out with different sample and let us know if it does not work. Regex_Extract_SpecificData2.zip (26.8 KB)
However, the other format mentioned in your initial post, will not be handled, it does seem that there are one field added in it and one field removed. So, not an uniform extraction. If the extraction also needs to be adhered to different formats, we would need to know how many different formats are present and then try to generalise it if possible else we would have to use Document Understanding for various templates of the document.