I’m looking to extract a specific table within a PDF document with RegEx as I’m unable to scrape data with UiPath data scraping activity due to the version of Adobe reader we use in my organisation.
The PDF I’m working with is essentially a 7-page document with several tables but I’m looking to extract one specific table. for confidentiality reasons, I won’t be able to share the pdf doc itself but the table I’m looking to extract looks something like this;
Any pointers please?
FYI: the values in table are dynamic, some fields may or may not contain data on a case by case basis.
hi, thanks for your recommendation however it looks like the solution proposed is a 3rd party package - my company operates on a strictly No 3rd party package policy unfortunately so I won’t be able to use this solution
When the table data may or may not contain data (such as empty values in some cells), we might not be able to get it right with the regex /String manipulation. We could try with the Approach of Generate Datatable Activity, trying all it’s combinations. If not successful, we give a try on the methods of Interop Word c#. You could check the post below for the example workflow.
However, even with this method, it would not work always for some of the PDF types. We would need to perform a thorough check with the types of PDF that you would receive and then confirm if this works for all cases.
it says object reference not set to an instance of an object
also got this:
“message”: “Microsoft Word Cannot access individual rows in this collection because the table has vertically merged cells. Microsoft.Office.Interop.Word.Row get_First()”,