What can be the best way to selecting a anchor while extracting a value from a PDF while using Form Base Extractor in DU framework.
There are scenario where it is required to extract year and when ‘year’ label is used as anchor in that case, it is extracting the required year but also some additional irrelevant data from PDF document.
it might be becasue the range of the region you are taking, however we have post-processing in DU, there you should manipulate the data extracted in order to obtain the result you want
Thank you for responding.
Yes I get that - post processing can be done to manipulate the data. But data get extracted as different line item (ss below). However, I have following question.
- Generally what is the basic thumb rule we should follow while choosing a anchor. May be I am missing something basic while choosing
- Do we follow any best practice while choosing an anchor in du form extractor.
- What exactly you mean by range of region taking . Form (pdf) contain a tabular form inside one cell value can be added. So i have selected whole table cell where value will be found.
Screenshot below for extraction below.
Please check this on a step by step guide
Also while indicating if the cell sizes are changing or if there are no fixed boundaries generally table data might be coming different…as you might indicate a region in a file but in another file there might be two rows in the same region instead of one…there is no hard and fast rule as such…but its advised to use as precise region as possible to get the data rather than using more area
Thanks for sharing step to step guide on extraction.
I get that, area should be precise. Can you confirm one thing - let say there is a portion where first name can be displayed. Is it ok to select only 50% of given area while creating template or always i need to select whole 100% area.
if the first name box is fixed then select the whole area
else select the area where you would have the first name only…if any other identifiers might be found there then make sure you dont include that area
Q1 : Generally I select whole area where data can be placed (option 1 in screenshot) Or I should mark region like shown in option 2 while creating template and marking fields.
Q2: Mainly my question was on anchor. I can select precise area based on your advice and test. However, what is best way to select the anchor to support above selection of precise error.
1 st way is advised…
And are these scanned pdf?
You can select the anchor as city
I have selected all possible region (as shown above) for city to extract and also, used City is anchor.
Still that does not work, even if I have added few other label in the PDF as anchor. But in some fields I selected 5-6 anchors to make value extracted.
The ask here - Is it advisable to have too many anchors or are we generally following best practice?
Behavior is same in both cases; i) scanned ii) native.
It is advised to select multiple anchors if one is not reliable or not givinng results…so selecting multiple is fine
For your case looks like form extractor is not able to extract whole info…please try going with a machine learning extractor…and use a pretratined or train a new model and try using it