Extract PDF native text from grid table

NotFranmax · February 20, 2020, 3:01pm

I have PDF file and it contains numbers, which luckily are in native text, but numbers are in background border which is not recognized by read pdf activity. Then there is a gap - it can be easily seen visually but when text gets extracted - there is no indication that there is a gap. For example if you have table like that:
|214||231| |233|
The extracted text looks like 214 231 233

How to “catch” these empty spaces, so I can know that the value is empty in that cell?

Anthony_Humphries · February 20, 2020, 3:04pm

Split the data on the | character and remove any elements of the array equal to String.Empty. After that, trim each of the cells.

NotFranmax · February 20, 2020, 3:08pm

Unfortunatelly there is no | character. When data gets extracted it shows only numbers. Borderlines are not recognized at all.

Anthony_Humphries · February 20, 2020, 3:09pm

Is there additional space in the string when the cell is empty?

UiJack · February 20, 2020, 3:12pm

Looking at the screenshot, it appears to be part of a table. Try opening the pdf file with chrome browser and try data scraping. It should work.

If it isn’t client related data, please upload a sample pdf file so that other devs can try different approaches and let you know the solution.

Cheers.!

NotFranmax · February 20, 2020, 3:21pm

Thank You for prompt replies, unfortunatelly there is no additional space either. I will try Chrome solution and come back in a few days. I am unable to upload file here, it is sensitive info.

NotFranmax · February 21, 2020, 6:20am

Book1.pdf (36.5 KB)

I have tried chrome method but nothing useful too.
Please find a sample document attached. In this document it is clearly visible that there is a missing value in B column, but when using read from PDF it is not clear is it missing in B or C.

Jyotika_Halai · February 21, 2020, 7:37am

Hi @NotFranmax,
Kindly check the attached workflowMain.xaml (4.9 KB)

NotFranmax · February 24, 2020, 6:26pm

That solves that, but do not work with my original file (I can not distribute it). Anyways thank You

system · February 27, 2020, 6:26pm

This topic was automatically closed 3 days after the last reply. New replies are no longer allowed.

Topic		Replies	Views
READ PDF TEXT OUTPUT CONTROL Studio studio , question , activities_panel , pdf-extraction	13	111	September 10, 2024
Extract table in pdf with empty columns and create excel sheet Help excel , pdf , activities , question	2	1665	March 24, 2021
Extract pdf data to excel Studio studio , question , activities_panel , read-pdf , extract-pdf	2	569	July 17, 2023
Need help on extracting data from native PDF Studio pdf , activities , question	21	2445	February 24, 2021
Getting text from pdf Studio studio , question	4	816	May 30, 2022

Extract PDF native text from grid table

Related topics