Everyday we are getting emails with client data in the table. That table is located into the email’s body.
What we are doing: Extracting particular data and sending it to robots for further processing.
The Problem: length, width and rows of the table are dynamic. Everytime we are getting a little different tables, but patterns are quite stable.
I opened Outlook, used Ctrl+a, Ctrl+c and Ctrl+v into excel (later will try to write a macro or smth. to do it in the background without opening outlook and excel).
That’s what I got in the excel
I marked cells in green which should be extracted into the data table for the further processing.
My hypothesis what to do next:
- Somehow split the tables by the empty lines, so I will get 8 separate data tables.
- Some of the text is stable so I could use it as an anchor. for example to extract a word “My Banana” I should take the word which is lies under the “Company Name”.
To extract “Harry Potter” and his birthdate I should use Orgnr as an anchor, use some splitting and Regex.
To extract rest of data I should use “Reg.nr”, “End date”, “Insured object” as an anchors too.
Please confirm or deny my theory. If you would suggest the best solution how to manage that mess, it would be amazing!