Extract unstructured data (table) in a PDF

studio

#1

Hi, I’m trying to extract unstructured data in a PDF based on keywords, it contains different types of tables are available in multiple pages and they are in image format, starting from the second page and the keyword/data is not at the same position on every page. I used activities like getting OCR Text, Anchor Base, Scrapings, indexing, Substring(), etc. Is there any way to extract the same data in multiple pages based on the occurrence of the keyword rather than its position.
Please help me If there’s any solution to this problem with a simple example.


#2

Just scrap te whole text with OCR or with other activities and use a “Matches” activity which can perform a regex.
You can learn Regex at https://regexone.com/


#3

Thanks for your response Janick,

I have already tried scraping using OCR and I can find each table using its unique values, I am looking for a generic solution for my issue to use it across several PDFs. If there is any solution like position based extract/scarp please let me know.


#4

Hi All,
Please check and give me solution for my issues .

@janick1535 @vvaidya @sreekanth @ddpadil @Dominic


#5

I’m actually having I believe a similar issue and was wondering if anyone could help. I have used the Get OCR text and it pulls the information. Now I’m trying to assign a value and i use this: txt.substring(txt.IndexOf(“EMAIL ADDRESS (Please page image from Soure if available):”)+“EMAIL ADDRESS (Please page image from Soure if available):”.Length , txt.IndexOf("*COMMENTS:")-(txt.IndexOf(“EMAIL ADDRESS (Please page image from Soure if available):”)+“EMAIL ADDRESS (Please page image from Soure if available):”.Length))

My issue is I’m trying to pull the email but the email is on the next line not right after the:.image

Do you know how i can tweak my assign to work?

Thanks!