How to extract unstructured data from pdf and convert in to data table and loop should read all kind of data tabels

aarun · November 15, 2018, 3:32am

And created loop should read all kind of PDFs and convert in to excel table

aarun · November 15, 2018, 3:48am

indra · November 15, 2018, 3:55am

@aarun Can u share sample pdf

aarun · November 15, 2018, 4:29am

Unfortunately i cant share the documents those are confidential
Use any document which contains a unstructured data in a table format @similar condition

Rishi1 · November 15, 2018, 4:59am

@aarun share some dummy pdf so that we get to know what data you want from pdf. don’t share your confidential one just the dummy one

megharajky · November 16, 2018, 7:38pm

Recently was working on one task, where we have to extract the table from PDF to Excel… I tried this by passing the API and getting csv/html result from that.
Does your query sounds similar??

Thanks,
Meg

ClaytonM · November 16, 2018, 8:56pm

@aarun

You might consider some 3rd party tools and partners of UiPath, like Abby Flexicapture. The tool is very powerful but has a high learning curve. Essentially, it’s going to do exactly what you would do if you wrote some complicated string manipulations.

By that I mean, each PDF is different, so you need to use key words, then you need to store in which direction is the word you are looking for based on the key word. You also need to store what the Column Headers are so it can detect how many columns there are. But the trickiest part is handling columns that contain multiple words, which is where Flexicapture would be able to handle easier.

It definitely can be done though using String Manipulation after a Read PDF to text. You would need to create a workflow that knows what format pattern the value should be in and how many words. Like for example if you know a column with multiple words is between a column that has a long integer and a decimal amount, then you can easily join those words into that column. Which is why it can be complicated.

If your PDFs only have 1 word in each column, then it is usually pretty simple manipulation. For example, you could extract the block of data you need by taking the data between the column headers (stored in an array variable) and a key word that identifies the end of the data, then convert all spaces to a comma in that data, and you now have a comma-delimitted data set that can be written to CSV using the Write Text file activity.

So I guess what I’m saying is, you can create a workflow that does this and even handles multiple words, but it does get very complicated when trying to make this work for all pdfs. I actually do have a workflow component I wrote ages ago, but I don’t feel I can share it since it was written for a project within my company, and isn’t even being used yet.

Alternatively, you might be able to see the data as an element by opening the PDF with Assistive Technology (Ctrl+Shift+5 i think), but I can’t say I’ve done this successfully.

Also, if the PDF is an image, don’t bother with the OCR tools provided in UiPath because getting the scale and accuracy set right would be very difficult since each PDF could be shifted slightly as it is an image. However, Flexicapture handles OCR really well actually and you can even tell it which characters to use like if it will only be numbers or specific characters.

So those are my thoughts and tips on the topic

The reason everyone was asking for an example is because each pdf is different and you would need to create a different extraction for every pdf model. But, like I explained in this post is that you can create something very robust by using key words, directions, and using the column headers as an argument so it knows how to formulate all the columns and where the column header is on the file. Most of this is how Flexicapture is working but there is some impressive coding behind it that makes it work more robust. (fyi, we only went through a Trial of Flexicapture and haven’t actually been using it) — Most of our PDF extractions are using some string manipulation to formulate the data we need.

Regards!

niteshbutola5 · May 9, 2019, 7:18am

hi @ClaytonM

is this flexicapture works fine to extract unstructured data from email ?
I have 3 unstructured data extraction scenarios from email.
when I am installing this flexicapture I m getting this error
should I ignore this error?

thanks

Nitesh

priyankavivek · May 9, 2019, 8:48am

you need to buy this abbyflexi capture u will get a licensed product refer this link to know more about the product
https://abbyy.technology/en:products:fc:start

niteshbutola5 · May 9, 2019, 10:11am

Hey @priyankavivek

thanks for the advice
but i want to know that is there any other way to get it done for free.
i mean by regx , string manipulation or any other techniques.
please let me know.

thanks again

ClaytonM · May 9, 2019, 2:20pm

For email bodys, it might be easier to take the html of the body then convert that html to a table either with your own workflow or the community activity package:

https://forum.uipath.com/search?q=get%20body%20in%20html

Regards.

ClaytonM · May 9, 2019, 2:22pm

If you take the body as HTML, then you can replace the td and tr tags with delimiters to form a comma-delimited CSV format. This is how you would do this if you created your own workflow, or use the activity like I mentioned in previous post.

Topic		Replies	Views
How to extract data from unstructured pdf table Help pdf , activities , data_scraping , question	2	3025	February 24, 2021
How to extract a table from pdf to excel Studio excel , activities	18	6591	July 19, 2023
Extracting Data from PDF table to csv Help	7	2616	February 21, 2019
Unable to extract table data from pdf file Studio studio , question , tools	4	1216	October 10, 2022
Extract table structure from PDF Help datatable , excel , pdf	4	3708	October 20, 2019

Most Active Users - Yesterday
sharazkm32
donfeng91
SorenB
More details...

How to extract unstructured data from pdf and convert in to data table and loop should read all kind of data tabels

Related topics