This is my UI Path Workflow by referring to Youtube Guru.
I encountered issue when i try to use document understanding ML to extract the row content from the PDF table. I’ve setup the template manager as well.
Some of the columns are neat, mean the items extracted out are the item required by me.
But some of the column contain other information. How do i extract only the required
information ? (E.g Description column is not neat, it contains extra information)
Some of the row height might chance from PDF A to PDF B, how to address this issue ?
For example, some of the item maybe have longer Description, hence the height of that row is
Previously, i’ve tried using Regex to extract data, but it seems difficult
Photo below shows the area that i wish to extract from the table (Especially the Internal P/N)
Could you let us know if the PDF documents would always be Digital ? If not, Then would suggest to use Document Understanding with ML Extractors (Purchase Orders ML Package/Endpoint) as Form Extractors would not be able to capture the dynamic rows that would appear in the PDF at different times/for different inputs.
If it is Digital always, we could try to check with Regex or Regex Extractor, but we would need to get some sample data and confirm on whether it is really feasible to use it. If there are definite patterns in identifying the rows in the Extracted text from Read PDF Text activity, it should be possible using regex expressions.
The pdf is digital version. However, as you see by referring to below 2 photos, the parts highlighted in yellow are the required parts. Those with red color arrow are not store individually, which make the extraction using form extractor difficult.
Do you think it’s possible with the help of Regex ?
As already mentioned, you could rule out the Form Extractors considering that there are dynamic rows. But before going to check with Document Understanding, would suggest you to check out on the feasibility by understanding the fields needed to be captured.
If we would be able to identify a common pattern among the fields in different samples, then it should be possible using regex/String manipulations.
However, If there are different templates, then would suggest to go for Document Understanding.
Thanks for your advice, if I have fixed row height , but with variation in row number , mean each page may contain 0 to max 3 rows, then what is the best way if I want to extract the value highlighted in yellow + indicated in red arrow (most important) ?
Appreciate your kind help with above sample. I tried to paste above PO in chat gpt, and then get the regex and valid it in regex / UiPath, but i am unable to get all the regex that can capture the desired results (highlighted in yellow + indicated in red arrow)
For the sent sample, we are able to get the required Expected Output. You could validate it and test it out with different sample and let us know if it does not work. Regex_Extract_SpecificData2.zip (26.8 KB)
However, the other format mentioned in your initial post, will not be handled, it does seem that there are one field added in it and one field removed. So, not an uniform extraction. If the extraction also needs to be adhered to different formats, we would need to know how many different formats are present and then try to generalise it if possible else we would have to use Document Understanding for various templates of the document.