Regex : table extraction

Is there any way to capture the table values from scanned pdf document? There is always a possibility of getting the table in the digitized format below. Challanging is most of the documents coz of the quality - it is not providing better results in digitization. so table values may contain special characters or may not come in table format.

image

Digitized text in text mode. (alignment would be diff in forum, hence placed in image).

Qty. Model Code Instrument/department/model/Vision Make&Model,etc. T&A Charge
82 34ARE Guitar bag $154
22 A4314 play Bag and zipper $409.80

I have to check if the table and values are present. if values are present - have to extract the quantity and the model code in this table. Form extractor doesnt work as the scanned positions may change, is there any way in Regex expressions to capture the table values? the line item of the table may be null or may increase to 7(max) @Yoichi @prasath17 @Lahiru.Fernando

Hi @Pradeep.Robot - Please check if this meets your requirements…

Groups(1) = Qty
Group(2) = ModelCode
Groups(3) = Model details
Groups(4) = T&A Charge

1 Like

Hi,

Another solution for your sample:

System.Text.RegularExpressions.Regex.Matches(yourString,"(?<=Qty. Model Code Instrument/department/model/Vision Make&Model,etc. T&A Charge\r?\n|\G)(?<Qty>\w+)\s+(?<ModelCode>\w+)\s+(?<Content>\P{Sc}+)(?<Charge>\S+)(\r?\n|$)")

Main.xaml (7.7 KB)

Regards,

1 Like

@prasath17, @Yoichi : Thanks both of you for the solution provided. I will try and let you know with the same. Thanks much for the quick turnaround.