Unable to extract unstructured-data from PDF file

Hello,

I want to extract data from a pdf file, where “[x-value] - followed with heading” keeps on changing.

In below pic, the red-bordered thing only needs to be extracted, but not the paragraph.
image

But I’m unable to do so. Please help.

Thanks.

Sample PDF: File on MEGA

You could extract the entire document and use Regex and a matches activity to do this. A very basic solution to this would be the following regex. However if the titles ever included special symbols this would need to be reworked a little. Let me know if there are any special symbols that will show up and I can help you rework the expression.

Once you use the matches activity then you can do a for each loop on the extracted result and change each match to a string and place in an array/datatable or however you wish to store it.

1 Like

Hi @JosephNehl ,

I appreciate a lot for the regex-expression provided by you. I’ll try it and let you know if I get issues regarding it.

Thanks.

1 Like

Hi @JosephNehl , I’m unable to do the same if there are any special characters in the line.
Have a look at this regex101: build, test, and debug regex

For normal characters it works fine. Please help.

Hi hope i got the solution:
Regex should be: (\[\d{8}\][a-zA-Z0-9@#$%&*+\-_(),+':;?.,!"\[\] ]*)

Let me know if there’s any other solution.
Thanks.

@sudeshna214 Maybe you can shorten it to this :

\[\d{8}\].*

Hello,

Try the following regex pattern and see if this will work for you

([\d{8}][a-zA-Z0-9\S ]*)

This should include all of the special characters. Let me know if you need any further help!