Help / Expert advice needed: PDF Table extraction (Purchase Order to Excel)


This is my UI Path Workflow by referring to Youtube Guru.

I encountered issue when i try to use document understanding ML to extract the row content from the PDF table. I’ve setup the template manager as well.

  1. Some of the columns are neat, mean the items extracted out are the item required by me.
    But some of the column contain other information. How do i extract only the required
    information ? (E.g Description column is not neat, it contains extra information)

  2. Some of the row height might chance from PDF A to PDF B, how to address this issue ?
    For example, some of the item maybe have longer Description, hence the height of that row is
    bigger.

  3. Previously, i’ve tried using Regex to extract data, but it seems difficult

Photo below shows the area that i wish to extract from the table (Especially the Internal P/N)

1 Like

Hi @Nilochan ,

Could you let us know if the PDF documents would always be Digital ? If not, Then would suggest to use Document Understanding with ML Extractors (Purchase Orders ML Package/Endpoint) as Form Extractors would not be able to capture the dynamic rows that would appear in the PDF at different times/for different inputs.

If it is Digital always, we could try to check with Regex or Regex Extractor, but we would need to get some sample data and confirm on whether it is really feasible to use it. If there are definite patterns in identifying the rows in the Extracted text from Read PDF Text activity, it should be possible using regex expressions.

2 Likes

If you are confartable what I normal use its ItextSharp library in c# and invoke code to get all pdf data and fields. Then you can create an algorithm to do what you need

2 Likes

The pdf is digital version. However, as you see by referring to below 2 photos, the parts highlighted in yellow are the required parts. Those with red color arrow are not store individually, which make the extraction using form extractor difficult.

Do you think it’s possible with the help of Regex ?

Thank you for your suggestion. I have no knowledge with itextsharp library.
Any other suggestion that i can use ?

E.g something that can allow system to know for each of the line item, which parts are required (qty, part number, unit price), then it can extract intelligently even there is a change of pdf row size

@Nilochan

If you have access to AI center then you can train the documents on ML model and try to extract as each fiels has its owns identifier beside it…and its a continuous table

Apart from that can go with regex but it would be little difficult…fir this …first read all the data into text file and then use try defining eqch ptternt
Hat would extract

Cheerd

1 Like

Hi Anil, thank you . I’m currently using UiPath community version, not sure if I have access to AI Center. But I have the access to API for those extractor ( form extractor in this case ).

Regex is very difficult despite I’m having fixed line pattern 00010, 00020 , but the starting word and ending word of these line are not always the same , so it’s kinda difficult

@Nilochan

If you are on community version then you wont have access to ai center…you need to activate rnterprise trail

Regex yes is difficult looking at the pattern…

But one pattern is common is having

Cheersyour material no: common…can use that as identifier and get two lines before that and extract the data…

Cheers

1 Like

To extract only the required information from a PDF table using Document Understanding ML in UiPath, you can use a combination of the siffernet extraction methods

Like it’s starts from

  1. Using a taxonomy to define the structure of the table. This will help the ML model to identify the different columns in the table and the types of data that they contain.

  2. Use anchors to identify the start and end of each row in the table. This will help the ML model to extract the data from each row accurately, even if the row height varies.

  3. Use regular expressions to extract specific pieces of information from the extracted data.** For example, you can use a regular expression to extract the product name from the Description column.

A quick eg:

  1. Create a taxonomy to define the structure of the table. The taxonomy should define the following:
  • The names of the columns in the table.
  • The types of data that each column contains.
  1. Create a Document Understanding ML template and configure it to use the taxonomy that you created in step 1.

  2. Use the Document Understanding ML activity to extract the data from the PDF table.

  3. Use regular expressions to extract the specific pieces of information that you need from the extracted data.

For example

(?<productName>[A-Za-z0-9 ]+)

This regular expression will match any sequence of alphanumeric characters and spaces in the Description column. The captured group productName will contain the product name.

You can use a similar approach to extract other pieces of information from the extracted data.

If you are having difficulty using regular expressions, you can use the UiPath Regex Tester activity to test your regular expressions.

Let us know for further clarification

Cheers @Nilochan

1 Like

@Nilochan ,

As already mentioned, you could rule out the Form Extractors considering that there are dynamic rows. But before going to check with Document Understanding, would suggest you to check out on the feasibility by understanding the fields needed to be captured.

If we would be able to identify a common pattern among the fields in different samples, then it should be possible using regex/String manipulations.

However, If there are different templates, then would suggest to go for Document Understanding.

1 Like


Thanks for your advice, if I have fixed row height , but with variation in row number , mean each page may contain 0 to max 3 rows, then what is the best way if I want to extract the value highlighted in yellow + indicated in red arrow (most important) ?

@Nilochan ,

We do see a pattern on the values that are to be highlighted, Is it possible to share the Sample PDF or the Extracted Text from Read PDF Text (Preserve Format enabled) activity ?

We could provide you with the regex expressions/workflow solution by testing it from our end.

1 Like

PO_8210159911_20230704_030809.pdf (103.2 KB)

Appreciate your kind help with above sample. I tried to paste above PO in chat gpt, and then get the regex and valid it in regex / UiPath, but i am unable to get all the regex that can capture the desired results (highlighted in yellow + indicated in red arrow)

@Nilochan ,

For the sent sample, we are able to get the required Expected Output. You could validate it and test it out with different sample and let us know if it does not work.
Regex_Extract_SpecificData2.zip (26.8 KB)

However, the other format mentioned in your initial post, will not be handled, it does seem that there are one field added in it and one field removed. So, not an uniform extraction. If the extraction also needs to be adhered to different formats, we would need to know how many different formats are present and then try to generalise it if possible else we would have to use Document Understanding for various templates of the document.

1 Like

Hi supermanPunch, thank you very much.

Would you mind to share your packages ?
I encounter some difficulties when trying to load the files.

My pacakges as below, so i need to download those packages that are missing

Hi Supermanpunch, it’s working now!!!

The regex that u provided work perfectly !!!
I’m just wonder how to get this regex ? I tried with chatgpt, but unable to get the “quoted” result haha.

image

1 Like

@supermanPunch . I think this is the reason pertaining to the regex approach that you use is working perfectly !!!

You method : (i) Using the Regex to get the part of content inside the row
(ii) Using below method to split it

image

image

I think this is awesome, i think i can apply this to other format as long as the format is same (despite having different row height )

2 Likes

This topic was automatically closed 3 days after the last reply. New replies are no longer allowed.