THE WORLD-WIDE Web Consortium's XML Recommendation opens with a list of 10 design goals. The first goal states: XML shall be straightforwardly usable over the Internet. Straightforward or not, eXtensible markup language (XML) is used extensively over the Internet, and has become a de facto standard for data interchange.
Because of its association with Internet applications, however, automation engineers have sidestepped XML, assuming it has little applicability to their daily work. This is a mistake. Though this technology has been used extensively for Internet-based applications, XML is an extremely simple and flexible data format with untold uses waiting to be uncovered and implemented in industrial controls and automation.
Alone, XML data is simply raw text that has little to offer automation engineers. But XML isnt alone. Developers everywhere have jumped aboard the XML bandwagon to create a seemingly bottomless reservoir of tools, applications, services, and standards all designed to create, consume, translate, store, and present XML data. This infrastructure of supporting applications is what makes XML such a compelling choice for application data. This article will introduce XMLs fundamental concepts for those who have so far managed to avoid this important technology. Parts 3 and 4 of this four-part article will address XML supporting technologies. [The ABCs of XML, Part 1 ran in CONTROL, June 06.]
Not a Typical Language
XML isnt a language in the sense that there are defined keywords, functions, or statements. XML is often compared to hypertext markup language (HTML) because it works well with HTML applications, has similar markup, and has been joined with HTML to create the XHTML specification. However, the HTML specification defines a list element tags like <body>, <h1>, <b>, and <i> with defined behavior for HTML browsers.
XML lacks a defined set of tags and allows anyone to create their own set of tags and attributes to suit their own application needs. Instead, the XML specification defines a set of markup rules that must be followed for the marked up text to be interpreted as XML data.
10 Well-Formed Rules
XML is organized in a logical or physical structure called a document. An XML document may be a file on disk, it may be streamed from a server, or it may be hard-coded text inside an HMI VBA application. Though the data may have many different sources, the document metaphor still applies as long as its well formed. To be considered a well formed, an XML document must adhere to the constraints defined in the W3Cs XML specification. These constraints can be distilled into 10 easy rules.
- XML is just plain old text
XML is designed to be human-readable as text. This means any text editor can be used with an XML document. A simple text editor will treat an XML document just as it would an INI file, a CSV file, or any text file.
- XML is data
XML is designed as a flexible self-describing data structure. By itself, XML cant do anything, nor does it define how data should be processed or handled. By contrast, HTML includes both data and a description of how it should be displayed in a browser.
- XML documents must have one root element
There can be only one top-level root element in an XML document, and all other elements must be between the root element start and end tags.
- XML white space data is preserved
HTML reduces consecutive white space characters to a single space character. With XML, white space is interpreted as datajust as any other character.
- XML naming rules
Element names cant include white space, must start with a letter, and cant include characters that are used for markup such as <, >, ;, &, among others. Its generally a good idea to limit element and attribute names to letters, numbers, and underscore.
- XML elements must be closed
An element can be closed with an end tag, or optionally with the shorthand notation for empty elements. By contrast, HTML doesnt require that elements be closed. In fact, most browsers will attempt to render any HTML element whether or not its closed properly. XML is not so forgiving. An empty element is one with no value and no child elements, though it may have attributes. An empty element may be closed with the shorthand notation /> at the end of the start tag. For example <Value /> is equivalent to <Value></Value>.
- XML elements must be properly nested
In HTML, elements were allowed to overlap like this: <b><i>bold and italic text</b> italic only</i>. This type of element crossing is not allowed in XML. An element that starts inside a parent element must end inside the same parent before the parent element is closed.
- XML is case sensitive
An XML tag <recipe> isnt the same as <Recipe>. However, HTML isnt case sensitive so <h1> is identical to <H1>.
- XML attribute names must be unique within an element
An element may have any number of attributes but each attribute name must be unique. The following example is incorrect:
<People Person="John Doe" Person="John Smith" />
This example could be structured properly as follows:
<Person Name="John Doe"/>
<Person Name="John Smith"/>
- XML attribute values must be quoted
XML attribute values must be enclosed either in single quotes or double quotes. If an attribute value contains a double quote character, then enclose the value in single quotes. Likewise, if the attribute value contains a single quote, enclose the value in double quotes. For attribute values that may include either type of quotation character, standard HTML character entities may be used, including " for the double-quote character and ' for the single-quote character.
In addition, the markup for a comment in XML is identical to HTML. The comment opens with <!-- and closes with --> and may span multiple lines.
An XML document may begin with an optional XML declaration. The XML declaration must precede all other content, and isnt considered part of the XML document. Its used to provide information to XML processors about the document's content. Because the declaration is not an element, it must not have a closing tag. The declaration looks like this:
<?xml version=1.0 encoding=utf-8 standalone=yes?>
If the declaration is included, version is the only required attribute, and must have a value of either 1.0 or 1.1. Version 1.1 supports special Unicode character handling functionality thats rarely needed, and, therefore, version 1.0 is used almost exclusively.
The encoding attribute defines the character encoding used by the document, so that an XML processor may properly parse the document. The default encoding used by XML processors is UTF-8.
Any number of processing instructions may appear below the XML declaration and before the root element. It must be enclosed in <? and ?> like the XML declaration, and provides application-specific handling information. A Microsoft Word 2003 XML document may include the following processing instruction, which tells the Windows operating system to identify the XML document as an MS Word file. When double-clicked, an XML file with this processing instruction will open in MS Word, as shown:
An XML document has a specific structure of element names, attribute names, and hierarchical parent-child relationships. As long as a document meets the requirements for well-formedness, it can have any structure and contain any data. This flexibility is what makes XML extensible.
However, applications that interpret XML documents have expectations that the XML will adhere to a particular structure. Validation is the process of checking an XML document for conformance to a defined structure or schema. A schema can be defined in the XML document, or a reference can point to an external schema document. There are multiple standards for defining a schema including Document Type Definition (DTD), XML Schema Definition (XSD) language, and XML Data Reduced (XDR). An XML document that that adheres to a defined schema definition is judged to be valid.
A schema isnt required when developing XML applications, and, in fact, can significantly complicate XML application development. When you control a document's content and related applications, you can work more efficiently without a schema.
Software vendors that support XML data normally publish a schema, so other applications can properly validate content before working with a document. A control system vendor that supports import of XML data into the control system will likely validate a document before the import process to prevent loading data that may lead to a control system fault.
XML namespaces solve a problem that can occur when an element name may have different meaning within a single document. For example, the element name template is an XSLT keyword, and its meaning is different than the template element used in an MS Word XML document. All elements and attributes in an XML document are included in a namespace, even if a namespace isnt explicitly declared. When no namespace is defined in a document, content is included in the default null namespace. A namespace may be defined as an attribute of the start tag of an element with the following format:
Where a namespace is declared for an element, all child elements with the same prefix are included in the same namespace. The element where the namespace is declared may also be included in the namespace if the same prefix is used in the element name.
<cc:Step cc:XPos="600" cc:YPos="600" AcquireUnit="yes">
In the sample above, the prefix cc refers to the namespace http://www.cascon.com/Recipe. Elements included in this namespace include recipe, step, and name. The element UnitAlias and the attribute AcquireUnit are included in the default null namespace.
The namespace prefix cc serves as a shorthand or alias notation for the full namespace http://www.cascon.com/Recipe. The actual namespace may be any string value but it is meant to be globally unique. XML parsers dont enforce uniqueness, nor do they expect any particular notation such as a web universal resource identifier (URI). A web style URI is frequently used because a real web URI like http://www.cascon.com/ is guaranteed to be globally unique across the Internet, which greatly minimizes the chance of colliding namespaces.
To simplify this example, the namespace can be declared as the default namespace with no prefix as shown in the following example:
<Step XPos="600" YPos="600" AcquireUnit="yes">
Notice that the namespace attribute xmlns no longer includes the prefix definition cc. Without a prefix, a namespace becomes the default namespace for the element where its declared. This makes the Recipe element and all its child element members of the namespace http://www.cascon.com/Recipe. Default namespaces apply to elements only, not to attributes. Therefore, the attributes of the Step element are included in a null namespace (equivalent to xmlns=""), and not the default namespace. This special behavior for attributes can be quite confusing. This quirk of default namespaces isnt difficult to work around as long as you understand how it works.
You'll likely not need to bother with namespaces in documents created for internal purposes. However, you'll need to understand namespaces when working with vendor-generated XML files. You will see the importance of namespaces in Part 3, which will cover XSLT.
A Sample of XML
THE SAMPLE fragment of XML software code (below) was lifted from a recipe exported from Rockwell Automation's RSBatch product. Rockwell defined the element and attribute names for describing an RSBatch recipe. Other system vendors may define a separate set of elements and attributes to describe a batch recipe. If you are a system integrator that works with batch recipe software from different system vendors, you may choose to define your own system-agnostic batch recipe XML data structure (or schema) for internal development purposes that can easily be converted to/from a vendor specific structure.
<!-- This is an XML comment -->
<Step XPos="600" YPos="600" AcquireUnit="true">
The first thing to notice about the code fragment is the angle brackets (< and >) which mark an XML tag. An XML element includes a start tag like <Step>, an end tag like </Step>, and everything between the two. Notice that the Step element contains four child elements Name, StepRecipeID, UnitAlias, and FormulaValue. The FormulaValue element contains five child elements. This parent/child relationship between elements shows that XML can support hierarchical data structures.
XML is designed to be self-describing. A document's data is stored in element values and attribute values, so that element and attribute names describe the data they hold much like data is described in a relational database by table names and field or column names.
Data stored as an element value is the text between start and end tags. In this sample, the value for the element EngineeringUnits is RPM. Data stored as an attribute value is found on the quoted right side of a Name="Value" pair. In our example, the Step element contains three attributes named XPos, YPos, and AcquireUnit, which have attribute values of 600, 600, and true respectively.
Resources on XML
There are many reference books and online resources for learning more about XML. Recommended online resources include:
|About the Author|