let’s say I have some native PDFs from where I have to extract some field value based on a keyword.but the problem is the pdf documents that come have some times same key name but different layouts…key names can be let’s say I have to extract invoice number field ,so key can be present as “invoice number” or “invoice no.” Like that. Can any invoke code method solve it?
let’s say I have to extract value of invoice name,but invoice name could be present anywhere (dynamic) in the page.i also have to extract the description field of “description” key which is the header.but problem is when the next pdf comes,the place of description key could be changed so how can I extract the values based on keywords?
Hi @Siddharth_Kumar1 ,
Interesting problem, I can think of 2 solutions for this -
Creating Regular Expressions for each of the different variations and using the most suitable expression based on the invoice (which can be distinguished either from file-naming convention or if you are receiving on email, then through sender)
Client doesnt have orchestrator or ai centre license. they run manually with bot. Is there any way in which i can extract value based on keyword searching?as regex and substring wont work when key could be placed anywhere in the document. based on key how to extract the value?and to extract the description value when the place of document header is changing in page, how can i extract. really confused bro
any way to extract value based on keyword ? if i search the keyword, wherever it may present i can extract the value.any solution like that bro?through programming language using python or c# or java or any other way?
bro, same file can come with different layout and places of keywords would be changed(it could be in any page or after string which was previously not present). so file naming convention would be same . how to extract then?Is there any method i can extract based on searching keyords in the pdf? so whichever file may come and whereever may the keyword exist, it will extract the value
As suggested by @Nishant_Banka1, better to go with the regex option if you do not have the capabilities to include the Document Understanding in the process.
Building the process with the regex is not easy, you have to get the number of pdf files from the client for one customer or vendor and absorb the changes in the template of pdf and can build the regex to get the data from all those PDF’s.
You have to do the same things for all the vendor / customers, you have to create a template for each and you can invoke each vendors xaml file based on the file name or the received from email id.
As it’s not a simple process it has lot of future enhancement, as the new type of PDF can come for the same vendor / customers which already is in your list, so you have to add some extra regex for that template and also if new vendor / customers comes you have to add it to your list and build the regex for that and use it.
For all this process you can use the Switch activity and in each case you can invoke each vendor’s template xaml and get the data from that.
You Can use regex to extract data from PDF. For check that you select write data or not use Regex101.
And then store your data in a variable. Using assign Activity.