I am trying to extract the tabular data from PDF using Abby cloud OCR. Position of field names are required for me to iterate through the table. Result parameter of Abby cloud OCR is of variable type IEnumerable<KeyValuePair<Rectangle,String>>. Please let me know how do we access this variable type
@Florent_Salendres might be a good bet for this one.
I’m not sure I understand what you are looking for.
If you want to create a variable for the ouput of abby clould OCR, you can use ctrl + K on “Result” of the the activitty, this will auto generate the variable.
If you want to access its value you can do as follow (AbbyResultsbeing the collection of keyValuePair of results)
Here we will take the first item (index 0)
abbyResult(0).Key should give you the Rectangle (Rectangle from the KeyValuePair) which is a System.Drawing.Rectangle
abbyResult(0).value will give you the text (Value from the KeyValuePair) which is a string
Hope it helps you, if you have any other question let me know, you can also share workflow and what extactly you would like to achieve.
Thank you for your response. I have attached the sample documents here. I need to extract column by column of tabular data and feed it into excel. The formats varies for each customer and so extracting by column would help us in formatting for 1000s of customers. Sample.pdf (178.0 KB)
Sample1.pdf (181.6 KB)
Your pdf seem like to be native (not scanned). There would be easier solution than using Abbyy i would say.
I would personally go for Regex or split on spaces for each line with the data you have.
Will be diffcult to make example of this shorly but i’m sure you could find example on the already existing post of the forum
Actual document is a scanned one. These samples are created just to provide an example of how the formats looks like. Regex and split on spaces will not be effective here since the description field will have more words separated by spaces and there is also a possibility of numeric portion in description. In some cases, some fields can be null.
Is there an option to read the tabular data column wise based on Abby Result?
technically yes, you could work out something ordering the extracted data by rectangle values starting from a specific area.
the challenge however here would be the percentage of correct results you would get out of the parsing which will be relying on the quality of the scan and how structured are the documents.
what kind of string output do you get from the scanned document? are they matching the real document values?
If the output that you get is reliable i’m pretty sure that a good regex could work his out as well but once again, this will rely on the correctness of the extract structure of the document you will be working on.
im trying to get the exact position of a word and im not able to create a Variable for the Output of Microsoft OCR’s result field. If i use CTRL+K in the field it sets up the variable. But there pops up a compiling error thats says “Type rectangle not defined”. What am i doing wrong?
I also try to extract something useful from my OCR read with AbbyyCloudOCR->Output->Result
But this just print the whole read, it isn’t divided as I expected it to be.
Are there other ways to get output from the Result variable?