Help / Expert advice needed: PDF Table extraction (Purchase Order to Excel)

Nilochan · October 13, 2023, 3:40pm

This is my UI Path Workflow by referring to Youtube Guru.

I encountered issue when i try to use document understanding ML to extract the row content from the PDF table. I’ve setup the template manager as well.

Some of the columns are neat, mean the items extracted out are the item required by me.
But some of the column contain other information. How do i extract only the required
information ? (E.g Description column is not neat, it contains extra information)
Some of the row height might chance from PDF A to PDF B, how to address this issue ?
For example, some of the item maybe have longer Description, hence the height of that row is
bigger.
Previously, i’ve tried using Regex to extract data, but it seems difficult

Photo below shows the area that i wish to extract from the table (Especially the Internal P/N)

supermanPunch · October 13, 2023, 4:39pm

Hi @Nilochan ,

Could you let us know if the PDF documents would always be Digital ? If not, Then would suggest to use Document Understanding with ML Extractors (Purchase Orders ML Package/Endpoint) as Form Extractors would not be able to capture the dynamic rows that would appear in the PDF at different times/for different inputs.

If it is Digital always, we could try to check with Regex or Regex Extractor, but we would need to get some sample data and confirm on whether it is really feasible to use it. If there are definite patterns in identifying the rows in the Extracted text from Read PDF Text activity, it should be possible using regex expressions.

Rodrigo_Silva · October 13, 2023, 6:02pm

If you are confartable what I normal use its ItextSharp library in c# and invoke code to get all pdf data and fields. Then you can create an algorithm to do what you need

Nilochan · October 14, 2023, 2:03am

The pdf is digital version. However, as you see by referring to below 2 photos, the parts highlighted in yellow are the required parts. Those with red color arrow are not store individually, which make the extraction using form extractor difficult.

Do you think it’s possible with the help of Regex ?

Nilochan · October 14, 2023, 2:07am

Thank you for your suggestion. I have no knowledge with itextsharp library.
Any other suggestion that i can use ?

E.g something that can allow system to know for each of the line item, which parts are required (qty, part number, unit price), then it can extract intelligently even there is a change of pdf row size

Anil_G · October 14, 2023, 4:30am

@Nilochan

If you have access to AI center then you can train the documents on ML model and try to extract as each fiels has its owns identifier beside it…and its a continuous table

Apart from that can go with regex but it would be little difficult…fir this …first read all the data into text file and then use try defining eqch ptternt
Hat would extract

Cheerd

Nilochan · October 14, 2023, 6:37am

Hi Anil, thank you . I’m currently using UiPath community version, not sure if I have access to AI Center. But I have the access to API for those extractor ( form extractor in this case ).

Regex is very difficult despite I’m having fixed line pattern 00010, 00020 , but the starting word and ending word of these line are not always the same , so it’s kinda difficult

Anil_G · October 14, 2023, 6:39am

@Nilochan

If you are on community version then you wont have access to ai center…you need to activate rnterprise trail

Regex yes is difficult looking at the pattern…

But one pattern is common is having

Cheersyour material no: common…can use that as identifier and get two lines before that and extract the data…

Cheers

Palaniyappan · October 14, 2023, 6:53am

To extract only the required information from a PDF table using Document Understanding ML in UiPath, you can use a combination of the siffernet extraction methods

Like it’s starts from

Using a taxonomy to define the structure of the table. This will help the ML model to identify the different columns in the table and the types of data that they contain.
Use anchors to identify the start and end of each row in the table. This will help the ML model to extract the data from each row accurately, even if the row height varies.
Use regular expressions to extract specific pieces of information from the extracted data.** For example, you can use a regular expression to extract the product name from the Description column.

A quick eg:

Create a taxonomy to define the structure of the table. The taxonomy should define the following:

The names of the columns in the table.
The types of data that each column contains.

Create a Document Understanding ML template and configure it to use the taxonomy that you created in step 1.
Use the Document Understanding ML activity to extract the data from the PDF table.
Use regular expressions to extract the specific pieces of information that you need from the extracted data.

For example

(?<productName>[A-Za-z0-9 ]+)

This regular expression will match any sequence of alphanumeric characters and spaces in the Description column. The captured group productName will contain the product name.

You can use a similar approach to extract other pieces of information from the extracted data.

If you are having difficulty using regular expressions, you can use the UiPath Regex Tester activity to test your regular expressions.

Let us know for further clarification

Cheers @Nilochan

supermanPunch · October 14, 2023, 12:22pm

@Nilochan ,

As already mentioned, you could rule out the Form Extractors considering that there are dynamic rows. But before going to check with Document Understanding, would suggest you to check out on the feasibility by understanding the fields needed to be captured.

If we would be able to identify a common pattern among the fields in different samples, then it should be possible using regex/String manipulations.

However, If there are different templates, then would suggest to go for Document Understanding.

Nilochan · October 14, 2023, 12:43pm

Thanks for your advice, if I have fixed row height , but with variation in row number , mean each page may contain 0 to max 3 rows, then what is the best way if I want to extract the value highlighted in yellow + indicated in red arrow (most important) ?

supermanPunch · October 14, 2023, 12:59pm

@Nilochan ,

We do see a pattern on the values that are to be highlighted, Is it possible to share the Sample PDF or the Extracted Text from Read PDF Text (Preserve Format enabled) activity ?

We could provide you with the regex expressions/workflow solution by testing it from our end.

Nilochan · October 14, 2023, 1:11pm

PO_8210159911_20230704_030809.pdf (103.2 KB)

Appreciate your kind help with above sample. I tried to paste above PO in chat gpt, and then get the regex and valid it in regex / UiPath, but i am unable to get all the regex that can capture the desired results (highlighted in yellow + indicated in red arrow)

supermanPunch · October 14, 2023, 1:47pm

@Nilochan ,

For the sent sample, we are able to get the required Expected Output. You could validate it and test it out with different sample and let us know if it does not work.
Regex_Extract_SpecificData2.zip (26.8 KB)

However, the other format mentioned in your initial post, will not be handled, it does seem that there are one field added in it and one field removed. So, not an uniform extraction. If the extraction also needs to be adhered to different formats, we would need to know how many different formats are present and then try to generalise it if possible else we would have to use Document Understanding for various templates of the document.

Nilochan · October 14, 2023, 2:09pm

Hi supermanPunch, thank you very much.

Would you mind to share your packages ?
I encounter some difficulties when trying to load the files.

My pacakges as below, so i need to download those packages that are missing

Nilochan · October 14, 2023, 2:21pm

Hi Supermanpunch, it’s working now!!!

The regex that u provided work perfectly !!!
I’m just wonder how to get this regex ? I tried with chatgpt, but unable to get the “quoted” result haha.

Nilochan · October 14, 2023, 2:36pm

@supermanPunch . I think this is the reason pertaining to the regex approach that you use is working perfectly !!!

You method : (i) Using the Regex to get the part of content inside the row
(ii) Using below method to split it

I think this is awesome, i think i can apply this to other format as long as the format is same (despite having different row height )

system · October 17, 2023, 2:37pm

This topic was automatically closed 3 days after the last reply. New replies are no longer allowed.

Topic		Replies	Views
Extracting information in a PDF table and relate it with the columns Studio uiautomation , studio , regex , question , document_understanding	2	1412	May 1, 2021
Extracting table from PDF and splitting row by column Studio studio , question , properties_panel	18	4044	April 20, 2022
Extracting Description(Table) from Invoice which is in pdf format and writing it to Excel Activities activities , question , document_understanding	16	1045	October 15, 2022
Issue in Table data extraction using Document understanding Activities orchestrator , activities , document_understanding	8	1554	May 20, 2022
Data Extraction From Purchase Order Help	10	1757	August 27, 2020

Most Active Users - Yesterday
rlgandu
mkankatala
ashokkarale
postwick
Yoichi
Anil_G
Parvathy
avejr748
lrtetala
MF.RPA
More details...

Help / Expert advice needed: PDF Table extraction (Purchase Order to Excel)

Related Topics