Hello, I am trying to extract datatable from pdf with Citrix Scrapping Relative with GOogle OCR, but till now it is impossible to me for extractracting data from the table in pdf, data is unorganized when i try to get it with Data Scrapping. If somebody know another way to do it, it would be helpful for me. Thank you so much.
Here is my suggestion. Only use OCR if the PDF is an image. If the PDF is text (where you can highlight each word), then use Read text from PDF or a similar method.
After you read the text you will need to use String manipulation to get it in the correct format with comma-delimitters and such. This will depend on the text and there’s no good way for me to answer this except learn lambda or LINQ expressions in vb.net so you can go through each line of the text and make edits.
If you have example pdf, then someone might be able to assist.
Once you have the format right for a CSV comma-delimitted file, just Write Text File with .csv extension, then Read CSV back into a datatable.
If the PDF is an image where you are required to use OCR, then this will take some playing around with the zoom and scale to get most accurate info. I’m not entirely sure I can help that much with that.
Hope this helps in some way…
Additionally, you can try Extract Structured Data, but from my experience it’s more difficult to get working (depending on the PDF though), since it uses elements and you might need to loop through each page. Also, sometimes some of the columns are missed when extracting using this method.