PDF Table Extraction of multiple pdf files

Hi All,

I am trying to extract TABULAR DATA from pdf files(10-20) files. The total pages of the pdf files may vary.

Using form extractor I am unable to extract the data

I am getting is missing information.

@dutta.marina

Regex or String manipulation can be a better way as the data looks more standard

cheers

Hi @dutta.marina ,

We would need to understand a bit more on the type of documents that you’ll be receiving.

  • Is it always Digital or Scanned ?
  • Is it only tabular data extraction and does the format/template remain same for all samples or inputs ?
  • Have you tried to check with Regex Extractor or Regex string manipulations separately ?

Using Form Extractor would mean that you have the data always in a specific format, even the number of rows should be the same for all the documents that you would be receiving. If this is the case, then we would need to check further on Form Extractor configuration done.

@Anil_G

will something like this work. I read the pdf , converted to text then used as below

@dutta.marina

should be working

also may be split and all need to be adjusted a little to get exact data

cheers

@Anil_G

This is pulling any information which is above milestone Table

@dutta.marina

read without formatting option

also this is only for one type of file where milestones are sequential

also as mentioned few other cleaning needs to be done and that is the first item…next items till last but one will be milestone info

cheers