Need help on extracting data from native PDF

Debartha_Mitra_DE · January 5, 2021, 10:02am

Hi,

Happy new year everyone.

I have one native pdf and trying to extract data from it.
The content is in table format and I tried to use anchor and text activities to extract data but everytime when I am trying to indicate text, its taking the whole table.

Please check the screenshot of the pdf content.

My scenario is this, I have one portal where I need to put these details, that portals has different text fields such as, Digital solution, offering version etc. If the portal has field such as offering version then I need to go to this pdf extract the offering version text which in this case is 1.1

Any suggestion on how to extract data from it.

Thanks in advance,

Pradeep_Shiv · January 5, 2021, 10:29am

Hello @Debartha_Mitra_DE,

use Read PDF Text activity and Matches Activity.

to get Version number use this pattern (?<=Offering Version of the digital solution).*

Cheers
@Debartha_Mitra_DE

Debartha_Mitra_DE · January 5, 2021, 12:49pm

Thanks for the answer, is there anyway I can create a datatable similar like the content that you can see on the pdf screenshot.

Pradeep_Shiv · January 5, 2021, 12:50pm

Yes, You can extract all the text and use Generate Datatable and Split based on WhiteSpace

Cheers

prasath17 · January 5, 2021, 1:01pm

Hi @Debartha_Mitra_DE if you are allowed to use CV activities…then pls try CV extract table inside CV scope…you need api key for CVfrom your cloud.

Debartha_Mitra_DE · January 5, 2021, 1:21pm

Thanks for the reply, as you can see below my values are populating like below if just print the outcome of Read Pdf text activity.

Now, how will I create the data table where one column will hold the all questions and the second column will hold all of its corresponding answers.

Its like key value pair, I can use dictionary instead of datatable as well.

Debartha_Mitra_DE · January 5, 2021, 1:23pm

Thanks for your reply, actually currently my solution is still in POC mode so my manager won’t approve to use cloud vision activities for now.
Also, there are some restrictions to use third party api in our organisation.

Pradeep_Shiv · January 5, 2021, 1:23pm

can you try Checking Preserve Format in properties panel and show us again?

@Debartha_Mitra_DE

Debartha_Mitra_DE · January 5, 2021, 1:34pm

Yes, you can see the output below

NIVED_NAMBIAR · January 5, 2021, 1:38pm

Hi @Debartha_Mitra_DE

Use this activity to extract table from PDF

https://epsilonai.com/how-to-extract-table-from-pdf-in-uipath

And then u can use datatable activitiy such as lookup datatable activitiy to extract field values

Hope it helps you

Regards

Nived N

Happy Automation

Pradeep_Shiv · January 5, 2021, 1:45pm

Hi @Debartha_Mitra_DE,

Well you can do this, Use Replace Activity to Replace whitespaces which are more than two with any special characters and Use that character as Delimiter to split into Columns

Cheers
@Debartha_Mitra_DE

Debartha_Mitra_DE · January 5, 2021, 1:51pm

thanks, can you please show me some sample of the code and also splitting into columns.

Pradeep_Shiv · January 5, 2021, 1:58pm

Sure,

try the below code and modify accordingly,

Replace.xaml (5.5 KB)

Cheers
@Debartha_Mitra_DE

Debartha_Mitra_DE · January 5, 2021, 4:05pm

Thanks a lot for the code, it worked but I think in my case using dictionary will better option rather using the datatable as I need to put the extracted values into fields on a portal.

How should I modify my code accordingly so I can store the values in a dictionary?

jrose · January 5, 2021, 10:00pm

If the Offereing version text always the second row in the data table? if so you can read the use the Read PDF text tool and use the split tool to assign each line of the table to an index of an array. then use the match or string manipulation to pull the version out.

<yourPdfTextVarible>.Split(Environment.NewLine.TocharArray) in the Assign block.

The same technique could be used for assigning to a dictoinary (key-value version) if you can find the delimiter between the cells of the table.

Debartha_Mitra_DE · January 6, 2021, 8:33am

Thanks for your response, actually I have created one excel sheet using the data table (code share by @Pradeep_Shiv ) but everything has been populated in one row in the excel. Check below

I have also populated the data table on message box fyr, please see below

I think dictionary will be better to use in my case, can you please share the code for the dictionary.

Thanks in advance.

Debartha_Mitra_DE · January 6, 2021, 9:57am

please let me know if I can use dictionary to store the extracted data rather than using datatable.

Pradeep_Shiv · January 6, 2021, 10:04am

you can do by extracting Individual string by using Matches activity as I mentioned earlier and use Add Dictionary activity

or

Create a dictionary variable type of System.Collections.Generic.Dictionary(of String, String) with Default value as New Dictionary(of String, String)

then use assign activity like this dict_Var(yourKeyVariable) = Matchedregex(0).ToString

use this method to extract all the required detail Need help on extracting data from native PDF - #2 by Pradeep_Shiv

Cheers
@Debartha_Mitra_DE

Debartha_Mitra_DE · January 6, 2021, 1:00pm

Thanks, so I suppose that I need use for each loop for Matchedregex which will loop through this and assign every match value into dictionary ?

Pradeep_Shiv · January 6, 2021, 1:01pm

Yes, you can do that!

Cheers

Topic		Replies	Views
Extract Table data from PDF Help datatable , studio	19	16710	August 29, 2019
How to extract different table data from scanned pdf Studio studio , question , activities_panel	31	1180	January 7, 2023
Unable extract table data Academy Feedback	2	1059	February 24, 2021
Extracting table from PDF and splitting row by column Studio studio , question , properties_panel	18	4481	April 20, 2022
Table extraction from Pdf Studio studio , question , tools	7	3577	March 5, 2023

Need help on extracting data from native PDF

Related topics