PDF Data Extraction using Regular Expression

NiranjanKN · August 4, 2019, 5:37am

Hey all,
I need to extract the GSTIN of SpiceJet Limited, Invoice No, GSTIN / UIN of Customer and Total Invoice Value from the PDF using Regular Expression.
I’ve done this Xaml file, but couldn’t remove the colon from extracted data.Sequence7.xaml (9.3 KB) .

Thanks in advance.

Manjuts90 · August 4, 2019, 6:19am

@NiranjanKN Try to use replace method of string

Manjuts90 · August 4, 2019, 6:33am

@NiranjanKN Check below xaml file.

Sequence7.xaml (12.6 KB)

output file.

Out_Sample.xlsx (7.2 KB)

NiranjanKN · August 4, 2019, 6:35am

@Manjuts90 @lakshman @ClaytonM @n I used (?>UIN of Customer:) (.)(?=Place of Supply) and got value of GSTIN / UIN of Customer. But, if I use (?>Invoice No:)(.) it’ll fetch the entire row data.
But, I need only Invoice Number and not the Original Invoice number.
How can that be extracted.

NiranjanKN · August 4, 2019, 6:48am

@Manjuts90 Can you please explain me this : ReadPDF.Split({"GSTIN of SpiceJet Limited : "},StringSplitOptions.RemoveEmptyEntries)(1).Split({Environment.NewLine},StringSplitoptions.RemoveEmptyEntries)(0).Trim Regular Expression which you have used.

Manjuts90 · August 4, 2019, 7:13am

@NiranjanKN

First split method split the whole pdf data into array of 2 elements, first element of array contains text before “GSTIN of SpiceJet Limited :” and second element contains text after “GSTIN of SpiceJet Limited :”
So i took second element from array which contains required text for further processing. Since 2nd element in array contains text in multiple lines so i splitted the 2nd array element with respect to newline. So i got new array with each line as one array element.
Since value is present in first element of new array, so i took index of the element as “(0)”. Trim is used to remove extra spaces front and back of the string.

If you still have any doubts let me know

NiranjanKN · August 4, 2019, 11:18am

@Manjuts90 Can you explain me as to how you got this :
System.Text.RegularExpressions.Regex.Match(ReadPDF,“(?<=Invoice No: ).+”).ToString
I also tried it, but since there were two words matching Invoice No:, how did it select the second Invoice No:, not the first Invoice No:.

Manjuts90 · August 4, 2019, 1:28pm

@NiranjanKN I have given condition like below. after No: i have given space where as in first number after space not exists after “:”

“(?<=Invoice No: ).+”

system · August 7, 2019, 1:28pm

This topic was automatically closed 3 days after the last reply. New replies are no longer allowed.

Topic		Replies	Views
Rpa dev Help studio	56	1651	December 16, 2019
GSTIN from invoice pdf using Regex extractor - Document understanding Studio	5	806	February 16, 2022
Extracting a specific data from txt file Help studio , regex	7	4807	October 29, 2018
Extract data with different Names Studio studio	6	1231	August 18, 2020
Get String of Particular Format from String Help activities , string , question , data_manipulation	2	860	November 17, 2019

PDF Data Extraction using Regular Expression

Related topics