Extraxt data from unstructured pdfs and made it structured

hi all, i just came across this software which seems amazing but before checking it in details i want to ask whether i could accomplish my task.

I have a series of pdf documents (flyers from different stores). On each page of each flyer there can be a different number of products. The layout of those listed product also changes. I want to extract info such as Title /Description /Image /Price for each product and create a single record (either excel/csv/py/xml/json) for each of these sets.

Is it something possible through UI Path? And above all is it scalable? Any real examples i can see?

Many thanks in advance.
Marco.

Hello @Marco2,
you can try to use Document Understanding Framework for purpose of extracting data from flyers but I don’t think that it will work for different templates, you will need to build most of templates in that framework.

You can try to build Regex for your solution if you only need to capture something in table format you could do that.

I don’t know if you have examples of that flyers but UiPath recommends that you don’t use unstructured documents which constantly changes, you can check that in UiPath Academy.

Cheers,
Dino

Hello Marco,
In this video, I have 17 use-cases for extracting tables from PDF and write data in Excel:

2:00 GitHub free code for all the files
2:20 Logic of general workflow
4:40 File 1 simple PDF
9:50 File 2 PDF with a column with multiple lines
20:10 File 3 PDF with a column with multiple words ON the LAST column
27:00 File 5 PDF with a column with multiple words ON inside column (2 columns)
31:40 File 6 PDF with a column with multiple lines
39:10 File 8 simple PDF
42:15 File 9 PDF with multiple spaces on that need to be correct
45:50 File 10 PDF with multiple columns that have multiple lines + multiple pages
55:50 File 11 simple PDF with protection empty Cells
58:35 File 12 Big PDF with an empty line and Empty columns and partial total
1:02:25 File 13 PDF with multiple columns that have multiple words and hard to define a rule
1:10:15 File 15 PDF with multiple columns that have multiple lines
1:12:50 File 17 simple PDF remove spaces from headers also remove space from Data
1:16:05 File 18 simple PDF
1:17:10 File 19 PDF with multiple pages and columns with multiple lines
1:22:10 File 20 PDF with multiple columns that have multiple lines
1:25:00 File 21 PDF with empty columns and subtotal

Code:

Thanks,
Cristian Negulescu