PDF Scrapping Field Value Data problem

Hello,

I want to scrape data from the “w2 form” (pdf) so that i can use it to save into database but not able to get field wise data.
@ I have tried “Read PDF text” which reads the whole document fetches all text but i want to find field wise values like,
Employee’s social security number => 1234 56 7890
Employer identification number => 11-22334455
@ I have tried “Screen scraping”, “Data Scraping” but not able to get any specific element.
@ I have tried “Anchorbase” activity with “Find image” and “Get Text” but not able to select specific element.
Please find attached pdf document for your reference.
PR2010FormW2.pdf (84.8 KB)

Any help will be appreciated.

Thanks.

Hi @Sangram_Kokate,
Welcome to the Community!
You can use Read PDF Text activity and keep output of the text in String variable. Then use Matches activity and use proper expression on it to filter your string so the output of Matches will be the string variable with your needed number.

Example:

Thanks Pablito for your valuable suggestion but i have to parse all fields from that form and there is no exact (distinct/unique) pattern for all of the fields.
So i am looking for Achorbase activity to find text/image (Field name) and action to get Text (Field value). This way we can get exact field name and corresponding matching value. Unfortunately it is not working at my end.
Can you suggest me a way to get it working?

Please check this:

With use of regex you can also point to section from which you will get value :wink:

Thanks @Pablito and really appreciate your suggestion but it will be exact right solution when you have standard and fixed input string. In our case it is pdf parsed data so it is going to vary a lot.
Can you show some highlight on the “Anchorbase” activity?

Please have a look here:


There is also example of use of this activity.

Hello @Pablito , i already gone through it and knows it’s basic functionality but in our case specific element is not selected that’s my problem.
When i select pdf using “screen scraping” option it says “method failed to scrape this UI Element”
I have gone through this example video which talks about the specific element scraping but it is not working in my case.

This may mean that your pdf has “hard coded” text or text is represented as image. In that case only OCR/image based activities will handle this.

But strange thing is that if I use “Read PDF with OCR” it returns empty string.
Anyways i want to know if i use AnchorBase activity and find image to select the particular field name what should i use instead Get text (This is not selecting element so not works as action) ?

Hi,
Please find the attached flow.BlankProcess.zip (91.2 KB)

Hello @kirti.iyer thanks for your valuable help really appreciate it and helpful also. But you are also pointing in same pattern matching technique which will not be the exact/right answer since our data input is parsed pdf data which is going to vary. SSN and EIN are the two fields which have some fixed/standard pattern but suppose both parsed pdf data contains only 9 digit numbers (without any chars/spacing) how we can differentiate among them. Also if you look at the pdf fields from numbers 1 to 12 and remaining fields also contain numbers only. We cannot easily detect it by regex pattern matching we need some sound/full proof technique (Anchor Base - shared video talks about it) which deals with specific element/anchor.
Can you please suggest some way to use Get text method as action for anchorbase activity?

Hi @Sangram_Kokate
Maybe this is helpfullBlankProcess.zip (91.4 KB)

This approach looks promising but i am wondering there is no “indicate region” option/function to select particular area from screen. How we can select the particular region?

Its based on x - axis, y - axis, height and width of the PDF. You can get data from a particular region in PDF depending on the co-ordinates you give. You have to try a lot by changing the co-ordinates to get specific content.

You can change and try the below regex for the first workflow.

^\d.\s\s\d.

^\d.-\d.?\s

Tried a lot with “Read text from specific region” as mentioned but not able to get particular region to parse.
I think we need to find it with Find relative image and get OCR Text but it is not working in my case. Can you tell me why it is not working?

how many types of PDF’s you have to extract the data?
can you access the elements in the PDF’s?
if possible can you share a sample PDF with us so it will be really helpful for quick solving of solution.

Hello @Vamsi_Nori i have already shared pdf in my question, it has similar format only. I am not able to get the elements with scrapping text method. In “native”, “Full text” and “ocr” shows empty while recording. Only read PDF text is working and it gets the whole page text from which we cannot get all column values by pattern matching.
Can you check why i am not able to get elements from pdf?

hi @Sangram_Kokate
you can extract the text by using constant text near it.

for example:
you want to extract “Employee’s social security number” and it value is 1234 56 7890.

first save the output of Read PDF activity into string
split the string into string array on a condition that of new line and remove null entries in it
then find the position of the value relative to the constant text and extract it.

let us assume that a is the array
let the value(Employee’s social security number) is second value of array i.e a[1] and if this value(1234 56 7890) is 4th value of array i.e a[3] which means it is 2 values apart from each other

now write for loop along the array, in that
if a[i].Contains(“Employee’s social security number”)
then
value = a[i+2]
end if

in this way you can extract the values that are required

regards,
Vamsi Nori

Really appreciate the logic but looking at our current example the splitting string with new line cannot separate expected values on each line.e.g. The pdf response snapshot is as follows,

b Employer identification number
1
Wages, tips, other compensation
2
Federal income tax withheld
11-22334455 48500.00 6835.00

All three field values displayed on one line and also doubtful if data entered by user are in same pattern while creating pdf document.
So i think this approach will not lead to exact solution.
If we can get element from pdf that will be the proper approach using relative scrapping.
Do you able to get elements from pdf?

Thanks,
Sangram

Hi,

I cannot find elements in the PDF.

now in your case you can mention in such a way that get line value for extraction of “Employer identification number”. after finding “Employer identification number” which is 5 positions away in negative direction from the value then split the line into array i.e this (11-22334455 48500.00 6835.00) becomes [“11-22334455”,“48500.00”,“6835.00”] then you can extract the first element that is b[0] in the line a[i+1] where i is the position of line which contains “Employer identification number”