The XML specification rules that ensure each XML document is well-formed are primarily concerned with syntax, but place no restrictions on how the elements should be structed or on what content they may contain.
The XML author can, however, provide additional rules to constrain structure and content in a "schema" that defines the legal building blocks of the XML document.
Providing a schema for an XML document brings several benefits:
If the elements in an XML document are not used in strict accordance with its schema rules, validation will declare that document to be invalid.
The schema must be sure to define a rule for each element and attribute that appears in the XML document in order for it to be declared valid by the XML parser.
Schemas may be written as Document Type Defiination (DTD) or as Xml Schema Document (XSD).
DTD schemas are created as plain text files with a .dtd file extension. These can then be linked to an XML document by inserting a <!DOCTYPE> declaration immediately after the XML identifier and any other processing instructions.
Where teh DTD schema is only intended for internal private use the <!DOCTYPE> declaration should have this synatx:
<!DOCTYPE root-element SYSTEM "url">
For instance, to nominate a DTD schema called local.dtd in the same directory as an XML document with a root <doc> element:
<!DOCTYPE doc SYSTEM "local.dtd">
Where the DTD schema is intended for external public usage the <!DOCTYPE> declaration should use the PUBLIC keyword in place of SYSTEM. It should also include a Formal Public Identifier (FPI) like the one below:
"-//OwnerName//DTD FruitVarieties//EN//"
where "-" is standard, replace "OwnerName" with your name, "FruitVarieties" as descriptive label, "EN" as language
The FPI can begin with a + if the DTD schema has been approved by a recognized standards body. The owner is the schema auther, the descriptive label provides a suitable title, and the language is defined using a standard two-letter ISO abbreviation.
The ficitious XML document listed below would reference an external DTD schema located in this domain.
<?xml version="1.0" encoding="UTF-8" ?>
<!DOCTYPE fruit PUBLIC "-//OwnerName//DTD FruitVarieties//EN//" "http://auxy.com/fruit.dtd" >
<fruit>
<apple>Golden Delicious</apple>
</fruit>
The schema rules for each element in an XML document are defined using <!ELEMENT> declarations in teh DTD. These rules specify which elements can be nested within other elements, and the allowable content for each element, using this syntax:
<!ELEMENT element-name allowable-content>
There must be precisely one <!ELEMENT> declaration for each element and each declaration must contain the element name exactly as it appears in teh XML document. Each declaration can specify the allowable contenet to be any one of the following:
| Content | Description |
| EMPTY | An empty element, such as <pic id="P3" /> |
| ANY | Any content is allowed |
| (#PCDATA) | Text content is allowed |
| (tagname) | A nested child element is allowed |
The term "#PCDATA" allows Parsed Character DATA - regular alphanumeric and punctuation characters.
Notice that #PCDATA and the tag name of allowed child elements must be surrounded by parentheses in the declaration.
A declaration that allows an element to contain ANY content defeats the purpose of creating a schema against which an XML decoument can be verified, so is best avoided.
Filename: hello.xml
<?xml version="1.0" encoding="UTF-8" ?>
<!DOCTYPE doc SYSTEM "hello.dtd" >
<doc>
<msg>Hello World!</msg>
<!-- Uncomment the next line to disobey schema rules. -->
<!-- <msg>Bad content</msg> -->
</doc>
<!ELEMENT doc (msg) >
<!ELEMENT msg (#PCDATA) >
First line to specify that the root <doc> element may only contain a single <msg> child element
Second line to specify that the <msg> child element may only contatin text
An <!ELEMENT> declaration in a DTD schema can specify a rule that multiple different child elemetns must be nested within a parent element.
The tag name of each child element to be nested is specified in a comma-separated list, in the same sequential order that the elements should appear in the XML document.
The entire list must be contained within parentheses in the <!ELEMENT> declaration of the parent element. For instance, a rule to havea an <apple>, <orange>, and <pear> element nest in sequence within a <fruit> parent elemetn would look like this:
<!ELEMENT fruit (apple, orange, pear)>
This rule defines a sequence of child elements that should each appear exactly once within the one parent <fruit> element. (see memo.xml & memo.dtd for example)
Child elements that are specified in an <!ELEMENT> declaration are normally allowed to occur exactly once within the parent element. The number of allowable occurences can be changed, however, by adding a special "occurence indicator" symbol after the child element name in the declaration.
The three possible occurence indicatores are listed below, along with a description of how they affect the rule for that element:
| Indicator | Description |
| + | Allows the element to appear one or more times within the parent elemetn. It must be included, and it can be repated idefinitely |
| * | Allows the element to appear zero or more times within the parent element. It is optional, but if included it can be repeated indefinitely |
| ? | Allows the element to appear zero or just once within the parent elemtn. It is optional, but if included it cannot be repeated |
<?xml version="1.0" encoding="UTF-8" ?>
<!DOCTYPE contacts SYSTEM "contacts.dtd" >
<contacts>
<title>Mr</title>
<forename>John</forename>
<!-- Uncomment the next line to break the schema rules. -->
<!-- <forename>William</forename> -->
<surname>Smith</surname>
<forename>Sally</forename>
<surname>James</surname>
</contacts>
<!ELEMENT contacts (title*, forename, surname )+ >
<!ELEMENT title (#PCDATA)>
<!ELEMENT surname (#PCDATA)>
<!ELEMENT forename (#PCDATA)>
An element may sometimes be required to allow a choice of child element - so allowable alternatives may be specified within its <!ELEMENT> declaration in a DTD schema.
The alternative child element names are separated by a "|" pipe characters, which is often used in programming languages to represent the boolean OR operator. The entire alternative statement must be surrounded by partheheses - to indicate that a choic is allowable. Possible alternatives may be single elements, or a sequence of elements contained in parentheses, or even another pair of alternatives contained in their own parentheses.
In tehe XML document listed below the root <doc> element contains a single sequence of <desc> and <image> elements. The schema must allow one <desc> elemetn and one or more <image> elements. Each <image> element must contain one of two child elements - one <src> element or one <alt> elemetn. The <desc>, <src> and <alt> elemetns must contain onley text.
<?xml version="1.0" encoding="UTF-8" ?>
<!DOCTYPE doc SYSTEM "image.dtd" >
<doc>
<desc>The new Dodge Challenger was designed around its muscle car ancestor, but with a twist of modern technology.</desc>
<image>
<src>front-quarter.jpg</src>
<!-- <alt>Exterior shot of the new Dodge Challenger</alt> -->
</image>
<image>
<!-- <src>front-interior.jpg</src> -->
<alt>Inside the leather high-back seats have a sunken ribbed look, just like the seats which came in the 1970 Challenger.</alt>
</image>
</doc>
<!ELEMENT desc (#PCDATA)>
<!ELEMENT src (#PCDATA)>
<!ELEMENT alt (#PCDATA)>
<!ELEMENT doc (desc, image+)>
<!ELEMENT image (src | alt)>
An element can be allowed to contain attributes to store meta data about that elemetn. Attributes are defined in a DTD schema with an <!ATTLIST> declaration, stating the element tag name, the attribute name, allowable data type, and inclusion constraint.
Unlike element content text (#PCDATA), the text values assigned to attributes is described as CDATA - Charater DATA.
Attributes whose inclusion is optional are described as #IMPLIED. Alternatively the inclusion constraint may specify a default value for the attribute, which will be used when no other value is specified in teh XML document. This is considered bad practice, however, as a schema's inteneded purpose is for validation only - stating default attribute values in a schema is best avoided.
An XML elemetn can have multiple attribues if each one is defined in teh DTD schema. A single <!ATTLIST> declaration can define multiple attribues, or each attribute can be defined separately in individual <!ATTLIST> declarations.
The XML document below includes an optional id attribute within most <album> elements - to identify the year of release:
<?xml version="1.0" encoding="UTF-8" ?>
<!DOCTYPE discography SYSTEM "album.dtd" >
<discography>
<artist>Pink</artist>
<album id="2000">Can't Take Me Home</album>
<album id="2001">Misundaztood</album>
<album id="2003">Try This</album>
<album id="2006">I'm Not Dead</album>
<album>(...in production)</album>
</discography>
<!-- to specify that the root element must contain a sequence with jsut one <artist> element, and one or moer <album> elements -->
<!ELEMENT discography (artist, album+) >
<!ELEMENT artist (#PCDATA)>
<!ELEMENT album (#PCDATA)>
<!-- allow each <album> elemetn to optionally have an id attribute to contain text values -->
<!ATTLIST album id CDATA #IMPLIED>
A DTD schema rule can ensure that an attribute must be included in an element by using the #REQUIRED keyword - in place of the #IMPLIED keyword that allows it to be optional.
Attribute values are note required to be unique, unless the schema rule also included the ID keyword.
Specifying that an attribute should be both ID and #REQUIRED is useful for attributes that must be included to contain unique values, such as product identification codes.
The XML docuemnt listed below requires an id attribute to be included with every <cactus> element to uniquely identify that variety of cactus:
<?xml version="1.0" encoding="UTF-8" ?>
<!DOCTYPE doc SYSTEM "cactus.dtd" >
<doc>
<cactus id="AZ1">
<name>Arizona Barrel</name>
</cactus>
<cactus id="AZ2">
<name>Arizona Beehive</name>
</cactus>
<cactus id="AZ3">
<name>Arizona Fishhook</name>
</cactus>
<cactus id="AZ4">
<name>Arizona Hedgehog</name>
</cactus>
<cactus id="AZ5">
<name>Arizona Pincushion</name>
</cactus>
</doc>
<!ELEMENT doc ( cactus+ )>
<!ELEMENT cactus ( name )>
<!ELEMENT name ( #PCDATA )>
<!-- this rule requiring each <name> to include a unique identity attribute, named "id" -->
<!ATTLIST cactus id ID #REQUIRED >
The <!-- and --> comment tags can be used to add comments within DTD documents - just like those in XML and HTML.
A DTD schema allows you to define your own entity references. These can subsequently appear anywhere in the XML document using that schema - both in content text and in markup. Although this is not compliant with the view that schemas should be for validation only, it does provide a handy way of creating shorthand abbreviations for lengthy strings of text that are used frequently.
An <!ENTITY> declaration specifies the shorthand reference name aind its full text string value. Once declared, the entity reference can be included in the XML document by stating its name, preceded by an "&" ampersand character, and followed by a ";" semi-colon character. For instance, the entity reference &myname; might refer to a text string entity of "Auxy". When an XML document is parsed by a browser all entity references are automatically replaced by their text string entity value.
The XML document below has four different entity references to lengthy strings that are all markup language names.
<?xml version="1.0" encoding="UTF-8" ?>
<?xml-stylesheet
type = "text/css" href = "history.css" ?>
<!DOCTYPE doc SYSTEM "history.dtd" >
<doc>
<para>Both &html; and &xml; are derived from &sgml; which is, in turn, a descendant of the &gml; that was developed in the 1960s by IBM. </para>
</doc>
<!-- Define the root element. -->
<!-- May contain one child element called para. -->
<!ELEMENT doc (para)>
<!-- Define the child element. -->
<!-- May contain Parsed Character Data. -->
<!ELEMENT para (#PCDATA)>
<!-- Define entity values. -->
<!-- Common markup language acronyms. -->
<!ENTITY html "HyperText Markup Language (HTML)" >
<!ENTITY xml "eXtensible Markup Language (XML)" >
<!ENTITY sgml "Standard Generalized Markup Language (SGML)" >
<!ENTITY gml "Generalized Markup Language (GML)" >
para { background: yellow; font-family:verdana, sans-serif; font-size:3mm; padding:5px;display:block; width:400px;}