Extract unstructured data (table) in a PDF

kamalakanann · April 20, 2018, 11:05am

Hi, I’m trying to extract unstructured data in a PDF based on keywords, it contains different types of tables are available in multiple pages and they are in image format, starting from the second page and the keyword/data is not at the same position on every page. I used activities like getting OCR Text, Anchor Base, Scrapings, indexing, Substring(), etc. Is there any way to extract the same data in multiple pages based on the occurrence of the keyword rather than its position.
Please help me If there’s any solution to this problem with a simple example.

janick1535 · April 20, 2018, 12:43pm

Just scrap te whole text with OCR or with other activities and use a “Matches” activity which can perform a regex.
You can learn Regex at https://regexone.com/

kamalakanann · April 23, 2018, 8:13am

Thanks for your response Janick,

I have already tried scraping using OCR and I can find each table using its unique values, I am looking for a generic solution for my issue to use it across several PDFs. If there is any solution like position based extract/scarp please let me know.

kamalakanann · April 24, 2018, 7:44am

Hi All,
Please check and give me solution for my issues .

@janick1535 @vvaidya @sreekanth @ddpadil @Dominic

tmartin · April 25, 2018, 4:25pm

I’m actually having I believe a similar issue and was wondering if anyone could help. I have used the Get OCR text and it pulls the information. Now I’m trying to assign a value and i use this: txt.substring(txt.IndexOf(“EMAIL ADDRESS (Please page image from Soure if available):”)+“EMAIL ADDRESS (Please page image from Soure if available):”.Length , txt.IndexOf(“*COMMENTS:”)-(txt.IndexOf(“EMAIL ADDRESS (Please page image from Soure if available):”)+“EMAIL ADDRESS (Please page image from Soure if available):”.Length))

My issue is I’m trying to pull the email but the email is on the next line not right after the:.

Do you know how i can tweak my assign to work?

Thanks!

Topic		Replies	Views
Extract unstrucured Data From PDF and not with a fixed Position on each Page Help uiautomation , pdf , activities , studio	7	2298	August 16, 2019
Extracting Unstructured Data from multiple pages of SIngle PDF File(around 52 pages) Help pdf , studio	0	1533	March 7, 2018
How to extract data from unstructured pdf table Help pdf , activities , data_scraping , question	2	3056	February 24, 2021
How to extract tables when multiple pages in pdf file Studio studio , question , activities_panel	9	787	November 23, 2023
Dynamic OCR data Extraction from PDF Help uiautomation , activities	2	1108	October 16, 2019

Extract unstructured data (table) in a PDF

Related topics