I have a scanned PDF from which I need to extract chemical composition like S, Si, Mn, C, Ti, Ni, Mo, and B. The order of these chemical fields may vary, so I want to use Form Extractor with anchor-based extraction in Document Understanding. I need a solution for this.
Hello @Vaishnavi_RP
The easiest way to extract chemical composition from a scanned PDF using UiPath Form Extractor is:
Steps to Implement:
- Load and Digitize the Document
Use “Load Taxonomy” to define chemical fields (S, Si, Mn, C, Ti, Ni, Mo, B).
Use “Digitize Document” with OmniPage OCR (better accuracy for scanned PDFs).
- Extract Data Using Form Extractor
Use “Data Extraction Scope” and select “Form Extractor.”
Inside Form Extractor:
Upload a sample PDF.
Select a field (e.g., Sulfur (S)).
Click “Add Anchor” → Select label text like “Sulfur” or “S:”
Mark the corresponding value as the data field.
Repeat for other elements (Si, Mn, C, etc.).
- Validate and Export
Use “Present Validation Station” (optional) for user review.
Use “Export Extraction Results” to save data in Excel, JSON, or a database.
Alternative (If Layout Changes Too Much)
If the chemical values change position frequently, use Regex Extractor instead of Form Extractor:
Use Regex Based Extractor in the Data Extraction Scope.
Define patterns like:
S:\s*(\d+.?\d*)
Si:\s*(\d+.?\d*)
Mn:\s*(\d+.?\d*)
This works well when values are written in a consistent format.
Would you like a sample workflow or regex patterns for different formats?
Thank you for your prompt response and for the anchor-based extractor guidance, it helped a lot.
Is there any way to handle scenarios where a scanned copy has a tick mark next to a chemical name, so accuracy is affected because the tick mark appears on the word?