01. Writing an XML document

The very first line of every XML document must begin with a processing instruction, to identify the document type as XML. Processing instructions are contained in special tags that start with <? and end with ?>. These must always appear at the very start of the document and can be used for the XML document identifier and to specify a stylesheet that is to be used with that document.

The XML document identifier tag is named xml, to signify it is part of the XML specification. It should then state the XML version number to its version attribute (currently 1.0) and the Unicode character encoding used by the document to its encoding attribute. Typically, the character encoding for English language documents is UTF-8 or the Latin-1 encoding ISO-8859-1 that is used by Windows. Languages with more complex characters, such as Japanese, ofter use UTF-16 encoding.

Optionally, the XML identifier tag can also include a pseudo-attribute called standalone that can have values of yes or no. If omitted its value is assumed to be no which means that references to values specified in external DTD schemas may be included. This is almost always desirable so standalone can be safely omitted unless there is some special reason not to do so.

<?xml version = "1.0" encoding = "UTF-8" ?>
<doc>
<msg>Hello World!</msg>
</doc>

Styling XML with CSS

A web broswer will always display an entire XML document unless it includes a reference to a stylesheet - so the browser can understand how to display the element contents. For instances, the XML document listed below will display its entire content when opened in a web browser:

<?xml version = "1.0" encoding = "UTF-8" ?>
<!-- Uncomment the line below to reference the stylesheet. -->
<!-- <?xml-stylesheet type = "text/css" href = "style.css" ?> -->
<subs>
<title>Hot Subs</title>
<desc>Piled high with flavor!</desc>
<sub>#1 Hot Pastrami $7.25</sub>
<sub>#2 Roast Turkey $6.50</sub>
<sub>#3 Cheese Steak $7.00</sub>
<plus>Add fries for just $1.25</plus>
</subs>

Stylesheets for XML documents can be created using the familiar Cascading Stylesheet (CSS) language that's used to style HTML.

subs { width:250px; background:orange; }
title { display:block; color:maroon; font:bold 36pt;}
desc { display:block; color:maroon; font:18pt; }
sub { display:block; padding:5px; border-bottom:1px solid maroon; font:12pt monospace; background:yellow }
plus { background:maroon;color:white; padding:5px; font:bold 12pt monospace; }

Understanding XML syntax

Although XML tags look very much like those in HTML, greater care must be taken with XML tags to adhere to the stricter synatx requiremenets of XML. Where an XML document has no synatx errors it is said to be "well-formed" and can be processed by an XML parser. Conversely, where the parser discovers a synatx error it will not process the document. This means that authors of XML documents must always ensure they are well-formed with regard to the XML document structure rules and naming conventions.

Document structure

XML documents must have exactly one root element, which contains all other nested elements.

XML start tags must always have a matching end tag and must be properly nested, and it is a case-sensitive language so tag names with differing letter case are regarded as entirely unrelated tags. Lowercase-only can be used for all tag names to avoid confusion.

XML elements that contain no data are said to be "empty" and can combine a start tag and end tag by inserting a "/" at the end of a single tag. For instance, <none/> is an empty element. The purpose of an empty XML element is often to supply an attribute value. All attribute values in XML must be conatined within quotes. For instance, <member id="21" /> is an empty XML element that contains a correctly quoted attribute value.

XML elements may be contain comments, nor can their content data contain the unescaped "<" or "&" characters.

Naming conventions

XML tag names may only begin with a letter, or the underscore character. The rest of the name may consist of any mixture of letter, number, underscore, dot or hyphen characters. Spaces are not allowed in XML tag names, nor can the name begin with the reserved string "xml" - in any case combination. For instance, <mytag>, <_mytag>, and <my-tag1.extra> are all valid names but <1mytag>, <my tag>, and <xmltag> are all invalid names.

<?xml version="1.0" encoding="UTF-8" ?>
<club>
<members>
<name id = "21" />
<name id = "22">Mike McGrath</name>
</members>
</club>

This document has one root <club> element. It avoids case errors by using only lowercase names that adhere to the XML naming conventions. The nested elements do not overlap and both attribute values are correctly contained within quotes. The first <name> element is empty. This could have been written as <name id="21"></name> but is instead a combined start and end tag - so all start tags have a matching end tag.

Correcting XML errors

When the XML parser discovers a syntax error it immediately stops processing and provides a helpful report describing the nature of the error. This information may be used to help trace the cause of the error so it can be rectified.

The ^ pointer indicates where the parser stopped. The cause of the problem is not immediately obvious as the error is not actually in the tag displayed

Employing an XML editor

XML documents are simply plain text that can be created in any text editor but specialized XML editors offer significant benefits. They provide synatx highlighting, to color XML keywords for clarity in text view, hierarchical representation of elements in grid view, and the ability to quickly see how the document will appear in browser preview. Most importantly, they also let you check that the document is well-formed, and valid against schema rules.

The leading XML editor is the XMLSpy application that is available for download as a fullly-featureed 30-day evaluation at http://www.xmlspy.com.

Adding comments and entities

It's always good practice to comment any source code to make its intention clear to other people, or when revisiting the code later. XML uses exactly the same syntax as HTML for comments so that any text between <!-- and --> is ignored by the XML parser.

Comments may not be inserted within an XML tag, nor may they include a double hyphen - as this confuses the parser. Usefully, you can "comment out" a section of XML code by enclosing it within comment tags to temporarily hide it from the parser.

The XML specification defines five special pieces of code for characters that are normally recognized as part of XML language. These are called "character tntities" and must be used to include those characters as part of an elemtn's content data. Character entities have a special synatx to represent the character in a way that prevents the XML parser interpreting the character as code:

Entity Character
& lt ; represents the < left angled bracket
& gt ; represents the > right angled bracket
& amp ; represents the & ampersand
& apos ; represents the ' apostrophe
& quot ; represents the " quotation mark

XML also supports entity references to Unicode character values - using the syntax &#nnnn; where nnnn is the character number. For instance, in Unicode the © copyright character has the number 0169; so it can be written in XML as the entity &# 0169;

In windows, you can find the unicode value of a character with the Character Map application under the System Tools menu. Click on the © character to see it can be written with Alt+0169 - its unicode number.

<?xml version="1.0" encoding="UTF-8" ?>
<?xml-stylesheet type = "text/css" href = "latte.css" ?>
<coffee>
<!-- Italian original. -->
<it>Caffe e latte</it>
<!-- Replace "e" in caffe with &#0232; -->
<!-- Literal English translation.-->
<en>Coffee and milk</en>
<!-- Replace "and" with &amp; -->
<!-- Familiar English term. -->
<fam>Latte</fam>
</coffee>

Adding XML attributes

Each XML element can include any number of attributes, named in accordance to the same naming conventions as for tag names. Within quotes, data can then be assigned to each attribute in much the same way it can be included within the elemtn tags. This creates a dilemma for the XML author - when to use attributes to contain data, and when to use nested elements?

The XML specificaiton offers no guidance on this question but most agree it's an "implementation decision" that the XML author must make, based on teh data's relevance to the element. Some prefer to use a nested element when the data can be considered to be a constituent part of the element and an attribute when it's not. For instance, arm and height are both properties of a human being but only the arm is a constituent part - you can cut off your arm, but not your height. In this case the height data would be assigned to an attribute and the arm data contained within a nested element. This is a good solution, however, because attribute values can only be string data types, which means assigned numeric values cannot be treated numerically later.

Assigning data to attributes has other problems too:

Eash of these problems is resolved by containing data within an element, rather than assignment to an attribute. Consequently, it is strongly recommended that you always use XML elements to contain content data - avoid using attributes for this purpose. The only exception is to include attributes that describe the elemtn for identification purspoes to stylesheets or to scripts. For instance, it's userful to have an id attribute or a class attribute identify a group of elements. The distinction to remember is that values assigned to these attributes is meta data (data describing the element itself), not content data.

<?xml version = "1.0" encoding = "UTF-8" ?>
<doc>
<memo id = "7"
title = "Reminder"
from = "Mike"
msg = "Never use attributes for content data!" >
</memo>
<!-- Becomes...
<memo id = "7">
<title>Reminder</title>
<from>Mike</from>
<msg>Never use attributes for content data!</msg>
</memo>
-->
</doc>