Unable to extract unstructured-data from PDF file

sudeshna214 · September 9, 2020, 8:24pm

Hello,

I want to extract data from a pdf file, where “[x-value] - followed with heading” keeps on changing.

In below pic, the red-bordered thing only needs to be extracted, but not the paragraph.

But I’m unable to do so. Please help.

Thanks.

sudeshna214 · September 9, 2020, 8:30pm

Sample PDF: File on MEGA

JosephNehl · September 10, 2020, 3:39pm

You could extract the entire document and use Regex and a matches activity to do this. A very basic solution to this would be the following regex. However if the titles ever included special symbols this would need to be reworked a little. Let me know if there are any special symbols that will show up and I can help you rework the expression.

Once you use the matches activity then you can do a for each loop on the extracted result and change each match to a string and place in an array/datatable or however you wish to store it.

sudeshna214 · September 11, 2020, 12:58am

Hi @JosephNehl ,

I appreciate a lot for the regex-expression provided by you. I’ll try it and let you know if I get issues regarding it.

Thanks.

sudeshna214 · September 12, 2020, 10:46am

Hi @JosephNehl , I’m unable to do the same if there are any special characters in the line.
Have a look at this regex101: build, test, and debug regex

For normal characters it works fine. Please help.

sudeshna214 · September 12, 2020, 10:53am

Hi hope i got the solution:
Regex should be: (\[\d{8}\][a-zA-Z0-9@#$%&*+\-_(),+':;?.,!"\[\] ]*)

Let me know if there’s any other solution.
Thanks.

supermanPunch · September 12, 2020, 1:38pm

@sudeshna214 Maybe you can shorten it to this :

\[\d{8}\].*

JosephNehl · September 12, 2020, 2:22pm

Hello,

Try the following regex pattern and see if this will work for you

([\d{8}][a-zA-Z0-9\S ]*)

This should include all of the special characters. Let me know if you need any further help!

Topic		Replies	Views
Not able to extract data after performing regular expression Studio studio , activities_panel	8	882	July 18, 2022
Unable to read PDF file which has unstructured template Studio studio , question , activities_panel	6	345	December 20, 2023
Can help on regex activity Activities uiautomation , pdf-extraction , pdf-to-excel	9	51	December 21, 2024
Regex to extract character in pdf table Studio pdf , studio , question , pdf-extraction	11	756	March 17, 2023
Extracting Tables data from string Activities pdf , regex , question	4	1137	October 8, 2021

Most Active Users - Yesterday
ashokkarale
Anil_G
sharazkm32
Steven_ds_55
Karla_Cristina_Santos_Cam
snehamayi.senapati
k-yamashita
mateuszmacheta
Manisha_Ravindra
Vishal_Verma1
More details...

Unable to extract unstructured-data from PDF file

Related topics