I have multiple pdf files where I have to extract specific text. The text may not be always in same format but there is a specific keyword which enables us to identify what text to extract? Since the formats are not always same ?How can we achieve this?
Here Milestone is the keyword: Sometimes the text is in tabular format sometimes its not.Its mostly under Performance and Miltesone section most of the times ,sometimes its just text description.
Hie @dutta.marina if the keyword is fix then you can use String manipulation method and pass the fix letter as a Reference .
as an example -( pdfOutput.Split(“RefrenceKeyword”)(1).tostring)
and if you have to extract data between 2 reference
pdfOutput.Split(“RefrenceKeyword”)(1).tostring.split(“RefrenceSecond”)(0).tostring.trim
change the reference and index position as per you need .
cheers
I have PDF files where I have to extract specific information (Milestone DEtails) from Mile Stone Table. I need to extract the milestone information from the section Performance and Milestone section.The Milestone details are in two different formats given below: How can I achieve this using Document understanding of Regex Extractor:
I need to capture Brief description, Amount Due Date
First format of PDF files
As this is a proper seggregated table …you can try with du model and training on it to extract the table data and you can as well classify before extracting for different formats