I need the PDF table to be extracted to Excel as it is.
The text in the PDF is selectable. However, there are many problems with the PDF:
Messy order of selection - The last line may be selected after the second line and so forth… Tried fixing the Read order in Adobe Reader settings without any success.
Formats are not known for the columns except the first two which are surely numeric…I was successful in extracting values for the first 2 numeric columns and the 3rd assuming it starts with a number, using the “environment.newline” method.
Also, there is a possibility of multi-line entries or no entries at all in each cell especially for the last column in the PDF table.
OCR is a fail as there are special characters or symbols expected in most of the cells.
In order to select the column values in order, I tried the Alt+Mouse button down-Hover image-Mouse button Up, however without any luck.
Office 365 Excel has the Image to table feature which works pretty well, but I found out that’s only android based and not available for desktop.
Can any one guide me in cracking this problem? Thanks very much.
I had the exact same problem in a robot I built 2 weeks ago for a client, except the data in the Pdf was an invoice instead of a parts list.
I used the Pdf Export function to Excel Workbook that comes in the paid version of Adobe Acrobat DC. It is about $15 / month. This function is not in Adobe Acrobat Reader.
It was not a perfect solution, as my tables came out slightly different from document to document. To use your data as an example, in one Pdf document, Pos and Qty come out as expected in columns A and B of the workbook, but in other Pdf’s I would get both values in Column A, and then Partnumber either in column B or C.
My solution was to write parsing logic on my excel workbooks that could extract the columns from a single datarow at a time. I looked at a dozen exported documents and figured out all the various combinations I was getting in a single row. Then wrote Regular Expressions to determine when two columns were combined, and split them as they should be. I also built a utility .xaml that would take a row and return the next value so that when Pdf → Excel inserted empty columns they would get skipped.
My problem was even more complex because I had a situation where two rows in the invoice actually went together, and I needed logic to combine them in certain situations, and I needed to handle the invoice header (Bill To, Invoice DAae, Invoice #, etc).
I know this all sounds a little complex, but the end solution worked pretty damned well, and has processed almost 1,000 pdf invoices.
Thanks mjdeale,
Seems this is a very common problem.
I had done a few improvisations, trying to extract the PDF table without loss of formats and order.
These might help or amuse you.
For some weird reason, selecting and highlighting text in the table, puts some order in selecting the text in the haphazardly made tables.
In Adobe reader settings, replace colors of the text and avoid line-art to be coloured. This makes the cell grid lines seemingly “go away” and makes OCR more reliable. Otherwise the grid-lines would be recognized as “l” or “!”, leading to a filtering nightmare.
Manually, Alt-click-drag selects a column values in a table. However, I am unsuccessful, yet to emulate this automatically in UIPATH. (Alt key+Click image with left mouse button down+Hover image+Click image with offset values and left mouse button up).
My unreliable experiences notwithstanding, these might be useful in many situations I believe.
Still searching for a rock-solid way to convert the PDF tables into Excel tables,whatever may be the text order.
Waiting for the experts to chime in with their suggestions.