Hello,
I want to extract data from a pdf file, where “[x-value] - followed with heading” keeps on changing.
In below pic, the red-bordered thing only needs to be extracted, but not the paragraph.
But I’m unable to do so. Please help.
Thanks.
Hello,
I want to extract data from a pdf file, where “[x-value] - followed with heading” keeps on changing.
In below pic, the red-bordered thing only needs to be extracted, but not the paragraph.
But I’m unable to do so. Please help.
Thanks.
Sample PDF: File on MEGA
You could extract the entire document and use Regex and a matches activity to do this. A very basic solution to this would be the following regex. However if the titles ever included special symbols this would need to be reworked a little. Let me know if there are any special symbols that will show up and I can help you rework the expression.
Once you use the matches activity then you can do a for each loop on the extracted result and change each match to a string and place in an array/datatable or however you wish to store it.
Hi @JosephNehl ,
I appreciate a lot for the regex-expression provided by you. I’ll try it and let you know if I get issues regarding it.
Thanks.
Hi @JosephNehl , I’m unable to do the same if there are any special characters in the line.
Have a look at this regex101: build, test, and debug regex
For normal characters it works fine. Please help.
Hi hope i got the solution:
Regex should be: (\[\d{8}\][a-zA-Z0-9@#$%&*+\-_(),+':;?.,!"\[\] ]*)
Let me know if there’s any other solution.
Thanks.
@sudeshna214 Maybe you can shorten it to this :
\[\d{8}\].*
Hello,
Try the following regex pattern and see if this will work for you
([\d{8}][a-zA-Z0-9\S ]*)
This should include all of the special characters. Let me know if you need any further help!