Need help on extracting data from native PDF

Hi,

Happy new year everyone.

I have one native pdf and trying to extract data from it.
The content is in table format and I tried to use anchor and text activities to extract data but everytime when I am trying to indicate text, its taking the whole table.

Please check the screenshot of the pdf content.

My scenario is this, I have one portal where I need to put these details, that portals has different text fields such as, Digital solution, offering version etc. If the portal has field such as offering version then I need to go to this pdf extract the offering version text which in this case is 1.1

Any suggestion on how to extract data from it.

Thanks in advance,

1 Like

Hello @Debartha_Mitra_DE,

use Read PDF Text activity and Matches Activity.

to get Version number use this pattern (?<=Offering Version of the digital solution).*

Cheers
@Debartha_Mitra_DE

1 Like

Thanks for the answer, is there anyway I can create a datatable similar like the content that you can see on the pdf screenshot.

1 Like

Yes, You can extract all the text and use Generate Datatable and Split based on WhiteSpace

Cheers

1 Like

Hi @Debartha_Mitra_DE if you are allowed to use CV activities…then pls try CV extract table inside CV scope…you need api key for CVfrom your cloud.

Thanks for the reply, as you can see below my values are populating like below if just print the outcome of Read Pdf text activity.

Now, how will I create the data table where one column will hold the all questions and the second column will hold all of its corresponding answers.

Its like key value pair, I can use dictionary instead of datatable as well.

1 Like

Thanks for your reply, actually currently my solution is still in POC mode so my manager won’t approve to use cloud vision activities for now.
Also, there are some restrictions to use third party api in our organisation.

can you try Checking Preserve Format in properties panel and show us again?

@Debartha_Mitra_DE

Yes, you can see the output below

Hi @Debartha_Mitra_DE

Use this activity to extract table from PDF

How to extract tables from PDF with UiPath - EpsilonAI

And then u can use datatable activitiy such as lookup datatable activitiy to extract field values

Hope it helps you

Regards

Nived N :robot:

Happy Automation :relaxed::relaxed::relaxed::relaxed:

Hi @Debartha_Mitra_DE,

Well you can do this, Use Replace Activity to Replace whitespaces which are more than two with any special characters and Use that character as Delimiter to split into Columns

Cheers
@Debartha_Mitra_DE

thanks, can you please show me some sample of the code and also splitting into columns.

1 Like

Sure,

try the below code and modify accordingly,

Replace.xaml (5.5 KB)

Cheers
@Debartha_Mitra_DE

1 Like

Thanks a lot for the code, it worked but I think in my case using dictionary will better option rather using the datatable as I need to put the extracted values into fields on a portal.

How should I modify my code accordingly so I can store the values in a dictionary?

1 Like

If the Offereing version text always the second row in the data table? if so you can read the use the Read PDF text tool and use the split tool to assign each line of the table to an index of an array. then use the match or string manipulation to pull the version out.

<yourPdfTextVarible>.Split(Environment.NewLine.TocharArray) in the Assign block.
image

The same technique could be used for assigning to a dictoinary (key-value version) if you can find the delimiter between the cells of the table.

1 Like

Thanks for your response, actually I have created one excel sheet using the data table (code share by @Pradeep_Shiv ) but everything has been populated in one row in the excel. Check below

I have also populated the data table on message box fyr, please see below

I think dictionary will be better to use in my case, can you please share the code for the dictionary.

Thanks in advance.

please let me know if I can use dictionary to store the extracted data rather than using datatable.

1 Like

you can do by extracting Individual string by using Matches activity as I mentioned earlier and use Add Dictionary activity

or

Create a dictionary variable type of System.Collections.Generic.Dictionary(of String, String) with Default value as New Dictionary(of String, String)

then use assign activity like this dict_Var(yourKeyVariable) = Matchedregex(0).ToString

use this method to extract all the required detail Need help on extracting data from native PDF - #2 by Pradeep_Shiv

Cheers
@Debartha_Mitra_DE

Thanks, so I suppose that I need use for each loop for Matchedregex which will loop through this and assign every match value into dictionary ?

1 Like

Yes, you can do that!

Cheers