Document Understanding - XML Extraction?

Hi all,

With the document understanding activities - when extracting data (in the DataExtraction Scope), there is a ‘Form Extractor’, and ‘Regex Based Extractor’ (etc), however, is there an equiverlant activity for XML based extraction that can be installed?

Currently the docu is not listing a XML extractor
let us know your details. Maybe we can help to setup a custom XML extraction approach that can be integrated within the flow

Is it something that we could do in a custom code block (or similar)?

Currently, we have a process which extracts fields from an xml document by supplying 2 elements of an xml path (a top level ‘parent’ field and a lower level ‘child’ field - for example, “Customer” and then “FirstName”).

Do your xml files have a consistent structure? Or they are completely different from one each other. In case they have a common structure you may use XML activities that come with UiPath.WebAPI.Activities package.

Otherwise if they don’t have a common structure you may treat them as text and try to extract data using RegEx or searching keywords in string.

Hope it helps!

They do have a consistant structure (for the most part). However, what we’re trying to do is process different instruction-documents from different clients. Some are XML documents (where we need to extract XML data), some are PDF files (which we use regex for)…

So XML Activities should be useful in your case

we prefer to process XML with XML Tools / Api

in case of only 2 elements are to retrieve and the XML element names are uniqu within the document, A regex approach can be checked.

Otherwise we would implement a custom XML Extractor step (do have the feeling that it not only a few lines within an invoke code) by:

  • define the document model as usual
  • define the XML Extractor config (e.g. which field will have which XPATH)
  • extract the the values, driven by the Document Type and it fields, using the Extractor config
  • manipulate / modify the ExtractionResult
    ExtractorResult Class

just to shortlist some essential building blocks