Data Extraciton from PDF tables

Pls elaborate the best way to extract data from row in PDFs. Here issue is multiple PDFs generates by system and number of rows in each PDF are not constant, but COLUMNs remain same. ROWs count changing in each PDF. BOT has to extract rows data and post the data in excel file.
Appreciate any of your support here…

we used string manipulation and Document understanding already…


You can try using regex if all the columns have data always…

If all columns does not have data essentially then you can try to check by reading the pdf with preserving format…and then count the characters for each column …so by understanding that…we can use fixed length approach to identify the data separately for each column even if some columns are not having data


It’s table ROWs issue and all ROWs carrying data. number of rows are not fix in PDF so if we use max limit in ROWS, BOT is extracting below lines (I mean BOX below data too)


If you are using string manipulation…

Then if there is any speicific text to identify the end of table …then we can first split on that to get only the table data

If there is a specific pattern we can use regex to identify pttern for end of table


You can prefer the below link for data extraction from pdf.

Hi @Rajesh_N ,

Is it possible for you to share a sample PDF file having the same pattern that is present in the PDF inputs that you get from the system ?(not your confidential PDF).

Also, Do take a look at the post below for Table Extraction from PDF :

NDA in place to share PDF but let me try and revert