2. HTML and XML

HTML allows to mark up (or describe) the structure of a human-readable web document or web user interface, while XML allows to mark up the structure of all kinds of documents, data files and messages, whether they are human-readable or not. XML has become the basis for HTML.

2.1. XML documents

XML provides a syntax for expressing structured information in the form of an XML document with nested elements and their attributes. The specific elements and attributes used in an XML document can come from any vocabulary, such as public standards or (private) user-defined XML formats. XML is used for specifying

  • document formats, such as XHTML5, the Scalable Vector Graphics (SVG) format or the DocBook format,

  • data interchange file formats, such as the Mathematical Markup Language (MathML) or the Universal Business Language (UBL),

  • message formats, such as the web service message format SOAP

2.2. Unicode and UTF-8

XML is based on Unicode, which is a platform-independent character set that includes almost all characters from most of the world's script languages including Hindi, Burmese and Gaelic. Each character is assigned a unique integer code in the range between 0 and 1,114,111. For example, the Greek letter π has the code 960, so it can be inserted in an XML document as π using the XML entity syntax.

Unicode includes legacy character sets like ASCII and ISO-8859-1 (Latin-1) as subsets.

The default encoding of an XML document is UTF-8, which uses only a single byte for ASCII characters, but three bytes for less common characters.

Almost all Unicode characters are legal in a well-formed XML document. Illegal characters are the control characters with code 0 through 31, except for the carriage return, line feed and tab. It is therefore dangerous to copy text from another (non-XML) text to an XML document (often, the form feed character creates a problem).

2.3. XML namespaces

Generally, namespaces help to avoid name conflicts. They allow to reuse the same (local) name in different namespace contexts. Many computational languages have some form of namespace concept, for instance, Java and PHP.

XML namespaces are identified with the help of a namespace URI, such as the SVG namespace URI "http://www.w3.org/2000/svg", which is associated with a namespace prefix, such as svg. Such a namespace represents a collection of names, both for elements and attributes, and allows namespace-qualified names of the form prefix:name, such as svg:circle as a namespace-qualified name for SVG circle elements.

A default namespace is declared in the start tag of an element in the following way:

<html xmlns="http://www.w3.org/1999/xhtml">

This example shows the start tag of the HTML root element, in which the XHTML namespace is declared as the default namespace.

The following example shows a namespace declaration for an svg element embedded in an HTML document:

<html xmlns="http://www.w3.org/1999/xhtml">
  <head>
   ...
  </head>
  <body>
    <figure>
      <figcaption>Figure 1: A blue circle</figcaption>
      <svg:svg xmlns:svg="http://www.w3.org/2000/svg">
        <svg:circle cx="100" cy="100" r="50" fill="blue"/>
      </svg:svg>
    </figure>
  </body>
</html>

2.4. Correct XML documents

XML defines two syntactic correctness criteria. An XML document must be well-formed, and if it is based on a grammar (or schema), then it must also be valid with respect to that grammar, or, in other words, satisfy all rules of the grammar.

An XML document is called well-formed, if it satisfies the following syntactic conditions:

  1. There must be exactly one root element.

  2. Each element has a start tag and an end tag; however, empty elements can be closed as <phone/> instead of <phone></phone>.

  3. Tags don't overlap, e.g. we cannot have

    <author><name>Lee Hong</author></name>
  4. Attribute names are unique within the scope of an element, e.g. the following code is not correct:

    <attachment file="lecture2.html" file="lecture3.html"/>

An XML document is called valid against a particular grammar (such as a DTD or an XML Schema), if

  1. it is well-formed,

  2. and it respects the grammar.

2.5. The evolution of HTML

The World-Wide Web Committee (W3C) has developed the following important versions of HTML:

  • 1997: HTML 4 as an SGML-based language,

  • 2000: XHTML 1 as an XML-based clean-up of HTML 4,

  • 2014: (X)HTML5 in cooperation (and competition) with the WHAT working group supported by browser vendors.

HTML was originally designed as a structure description language, and not as a presentation description language. But HTML4 has a lot of purely presentational elements such as font. XHTML has been taking HTML back to its roots, dropping presentational elements and defining a simple and clear syntax, in support of the goals of

  • device independence,

  • accessibility, and

  • usability.

We adopt the symbolic equation

HTML = HTML5 = XHTML5

stating that when we say "HTML" or "HTML5", we actually mean XHTML5

because we prefer the clear syntax of XML documents over the liberal and confusing HTML4-style syntax that is also allowed by HTML5.

The following simple example shows the basic code template to be used for any HTML document:

<!DOCTYPE html>
<html xmlns="http://www.w3.org/1999/xhtml" xml:lang="en" lang="en">
 <head>
  <meta charset="UTF-8" />
  <title>XHTML5 Template Example</title>
 </head>
 <body>
  <h1>XHTML5 Template Example</h1>
  <section><h1>First Section Title</h1>
   ...
  </section>
 </body>
</html>

Notice that in line 1, the HTML5 document type is declared, such that browsers are instructed to use the HTML5 document object model (DOM). In the html start tag in line 2, using the default namespace declaration attribute xmlns, the XHTML namespace URI http://www.w3.org/1999/xhtml is declared as the default namespace for making sure that browsers, and other tools, understand that all non-qualified element names like html, head, body, etc. are from the XHTML namespace.

Also in the html start tag, we set the (default) language for the text content of all elements (here to "en" standing for English) using both the xml:lang attribute and the HTML lang attribute. This attribute duplication is a small price to pay for having a hybrid document that can be processed both by HTML and by XML tools.

Finally, in line 4, using an (empty) meta element with a charset attribute, we set the HTML document's character encoding to UTF-8, which is also the default for XML documents.

2.6. HTML forms

For user-interactive web applications, the web browser needs to render a user interface. The traditional metaphor for a software application's user interface is that of a form. The special elements for data input, data output and form actions are called form controls. In HTML, a form element is a section of a web page consisting of block elements that contain controls and labels on those controls.

Users complete a form by entering text into input fields and by selecting items from choice controls. A completed form is submitted with the help of a submit button. When a user submits a form, it is normally sent to a web server either with the HTTP GET method or with the HTTP POST method. The standard encoding for the submission is called URL-encoded. It is represented by the Internet media type application/x-www-form-urlencoded. In this encoding, spaces become plus signs, and any other reserved characters become encoded as a percent sign and hexadecimal digits, as defined in RFC 1738.

Each control has both an initial value and a current value, both of which are strings. The initial value is specified with the control element's value attribute, except for the initial value of a textarea element, which is given by its initial contents. The control's current value is first set to the initial value. Thereafter, the control's current value may be modified through user interaction or scripts. When a form is submitted for processing, some controls have their name paired with their current value and these pairs are submitted with the form.

Labels are associated with a control by including the control as a child element within a label element ("implicit labels"), or by giving the control an id value and referencing this ID in the for attribute of the label element ("explicit labels"). It seems that implicit labels are (in 2015) still not widely supported by CSS libraries and assistive technologies. Therefore, explicit labels may be preferable, despite the fact that they imply quite some overhead by requiring a reference/identifier pair for every labeled HTML form field.

In the simple user interfaces of our "Getting Started" applications, we only need four types of form controls:

  1. single line input fields created with an <input name="..." /> element,

  2. single line output fields created with an <output name="..." /> element,

  3. push buttons created with a <button type="button">...</button> element, and

  4. dropdown selection lists created with a select element of the following form:

    <select name="...">
      <option value="value1"> option1 </option>
      <option value="value2"> option2 </option>
      ...
    </select>

An example of an HTML form with implicit labels for creating such a user interface is

<form id="Book">
  <p><label>ISBN: <output name="isbn" /></label></p>
  <p><label>Title: <input name="title" /></label></p>
  <p><label>Year: <input name="year" /></label></p>
  <p><button type="button">Save</button></p>
</form>

In an HTML-form-based user interface, we have a correspondence between the different kinds of properties defined in the model classes of an app and the form controls used for the input and output of their values. We have to distinguish between various kinds of model class attributes, which are mapped to various kinds of form fields. This mapping is also called data binding.

In general, an attribute of a model class can always be represented in the user interface by a plain input control (with the default setting type="text"), no matter which datatype has been defined as the range of the attribute in the model class. However, in special cases, other types of input controls (for instance, type="date"), or other controls, may be used. For instance, if the attribute's range is an enumeration, a select control or, if the number of possible choices is small enough (say, less than 8), a radio button group can be used.