Pls elaborate the best way to extract data from row in PDFs. Here issue is multiple PDFs generates by system and number of rows in each PDF are not constant, but COLUMNs remain same. ROWs count changing in each PDF. BOT has to extract rows data and post the data in excel file.
Appreciate any of your support here…
we used string manipulation and Document understanding already…
You can try using regex if all the columns have data always…
If all columns does not have data essentially then you can try to check by reading the pdf with preserving format…and then count the characters for each column …so by understanding that…we can use fixed length approach to identify the data separately for each column even if some columns are not having data
It’s table ROWs issue and all ROWs carrying data. number of rows are not fix in PDF so if we use max limit in ROWS, BOT is extracting below lines (I mean BOX below data too)
Is it possible for you to share a sample PDF file having the same pattern that is present in the PDF inputs that you get from the system ?(not your confidential PDF).
Also, Do take a look at the post below for Table Extraction from PDF :