HTML allows to mark up (or describe) the structure of a human-readable web document or web user interface, while XML allows to mark up the structure of all kinds of documents, data files and messages, whether they are human-readable or not. XML has become the basis for HTML.
XML provides a syntax for expressing structured information in the form of an XML document with nested elements and their attributes. The specific elements and attributes used in an XML document can come from any vocabulary, such as public standards or (private) user-defined XML formats. XML is used for specifying
document formats, such as XHTML5, the Scalable Vector Graphics (SVG) format or the DocBook format,
data interchange file formats, such as the Mathematical Markup Language (MathML) or the Universal Business Language (UBL),
message formats, such as the web service message format SOAP
XML is based on Unicode, which is a platform-independent character set that includes
almost all characters from most of the world's script languages including Hindi, Burmese and
Gaelic. Each character is assigned a unique integer code in the range between 0 and 1,114,111.
For example, the Greek letter π has the code 960, so it can be inserted in an XML document as
π
using the XML entity syntax.
Unicode includes legacy character sets like ASCII and ISO-8859-1 (Latin-1) as subsets.
The default encoding of an XML document is UTF-8, which uses only a single byte for ASCII characters, but three bytes for less common characters.
Almost all Unicode characters are legal in a well-formed XML document. Illegal characters are the control characters with code 0 through 31, except for the carriage return, line feed and tab. It is therefore dangerous to copy text from another (non-XML) text to an XML document (often, the form feed character creates a problem).
Generally, namespaces help to avoid name conflicts. They allow to reuse the same (local) name in different namespace contexts. Many computational languages have some form of namespace concept, for instance, Java and PHP.
XML namespaces are identified with the help of a namespace
URI, such as the SVG namespace URI "http://www.w3.org/2000/svg", which is associated
with a namespace prefix, such as svg
. Such a
namespace represents a collection of names, both for elements and attributes, and allows
namespace-qualified names of the form prefix:name, such as
svg:circle
as a namespace-qualified name for SVG circle elements.
A default namespace is declared in the start tag of an element in the following way:
<html xmlns="http://www.w3.org/1999/xhtml">
This example shows the start tag of the HTML root element, in which the XHTML namespace is declared as the default namespace.
The following example shows a namespace declaration for an svg
element
embedded in an HTML
document:
<html xmlns="http://www.w3.org/1999/xhtml"> <head> ... </head> <body> <figure> <figcaption>Figure 1: A blue circle</figcaption> <svg:svg xmlns:svg="http://www.w3.org/2000/svg"> <svg:circle cx="100" cy="100" r="50" fill="blue"/> </svg:svg> </figure> </body> </html>
XML defines two syntactic correctness criteria. An XML document must be well-formed, and if it is based on a grammar (or schema), then it must also be valid with respect to that grammar, or, in other words, satisfy all rules of the grammar.
An XML document is called well-formed, if it satisfies the following syntactic conditions:
There must be exactly one root element.
Each element has a start tag and an end tag; however, empty elements can be closed
as <phone/>
instead of
<phone></phone>
.
Tags don't overlap, e.g. we cannot have
<author><name>Lee Hong</author></name>
Attribute names are unique within the scope of an element, e.g. the following code is not correct:
<attachment file="lecture2.html" file="lecture3.html"/>
An XML document is called valid against a particular grammar (such as a DTD or an XML Schema), if
it is well-formed,
and it respects the grammar.
The World-Wide Web Committee (W3C) has developed the following important versions of HTML:
1997: HTML 4 as an SGML-based language,
2000: XHTML 1 as an XML-based clean-up of HTML 4,
2014: (X)HTML5 in cooperation (and competition) with the WHAT working group supported by browser vendors.
HTML was originally designed as a structure description
language, and not as a presentation description language.
But HTML4 has a lot of purely presentational elements such as font
. XHTML has
been taking HTML back to its roots, dropping presentational elements and defining a simple
and clear syntax, in support of the goals of
device independence,
accessibility, and
usability.
We adopt the symbolic equation
HTML = HTML5 = XHTML5
stating that when we say "HTML" or "HTML5", we actually mean XHTML5
because we prefer the clear syntax of XML documents over the liberal and confusing HTML4-style syntax that is also allowed by HTML5.
The following simple example shows the basic code template to be used for any HTML document:
<!DOCTYPE html> <html xmlns="http://www.w3.org/1999/xhtml" xml:lang="en" lang="en"> <head> <meta charset="UTF-8" /> <title>XHTML5 Template Example</title> </head> <body> <h1>XHTML5 Template Example</h1> <section><h1>First Section Title</h1> ... </section> </body> </html>
Notice that in line 1, the HTML5 document type is declared, such that browsers are
instructed to use the HTML5 document object model (DOM). In the html
start tag in
line 2, using the default namespace declaration attribute xmlns
, the XHTML
namespace URI http://www.w3.org/1999/xhtml
is declared as the default namespace for
making sure that browsers, and other tools, understand that all non-qualified element names like
html
, head
, body
, etc. are from the XHTML namespace.
Also in the html
start tag, we set the (default) language for the text content
of all elements (here to "en" standing for English) using both the xml:lang
attribute and the HTML lang
attribute. This attribute duplication is a small price
to pay for having a hybrid document that can be processed both by HTML and by XML tools.
Finally, in line 4, using an (empty) meta
element with a charset
attribute, we set the HTML document's character encoding to UTF-8, which is also the default for
XML documents.
For user-interactive web applications, the web browser needs to render a user interface.
The traditional metaphor for a software application's user interface is that of a form. The special elements for data
input, data output and form actions are called form controls. In HTML, a form
element is a section of a
web page consisting of block elements that contain controls and labels on those controls.
Users complete a form by entering text into input
fields and by selecting items from choice
controls. A completed form is submitted with the help of a submit button. When a user submits a form, it is normally sent to a web server
either with the HTTP GET method or with the HTTP POST method. The standard encoding for the
submission is called URL-encoded. It is represented by the
Internet media type application/x-www-form-urlencoded
. In this encoding, spaces
become plus signs, and any other reserved characters become encoded as a percent sign and
hexadecimal digits, as defined in RFC 1738.
Each control has both an initial value and a current value, both of which are strings. The
initial value is specified with the control element's value
attribute, except
for the initial value of a textarea
element, which is given by its initial
contents. The control's current value is first set to the initial value. Thereafter, the
control's current value may be modified through user interaction or scripts. When a form is
submitted for processing, some controls have their name paired with their current value and
these pairs are submitted with the form.
Labels are associated with a control by including the control as a child element within
a label
element ("implicit labels"), or by giving the control an id
value and referencing this ID in the for
attribute of the label
element ("explicit labels"). It seems that implicit labels are (in 2015) still not widely
supported by CSS libraries and assistive technologies. Therefore, explicit labels may be
preferable, despite the fact that they imply quite some overhead by requiring a
reference/identifier pair for every labeled HTML form field.
In the simple user interfaces of our "Getting Started" applications, we only need four types of form controls:
single line input fields created with an
<input name="..." />
element,
single line output fields created with an
<output name="..." />
element,
push buttons created with a <button
type="button">...</button>
element, and
dropdown selection lists created with a
select
element of the following
form:
<select name="..."> <option value="value1"> option1 </option> <option value="value2"> option2 </option> ... </select>
An example of an HTML form with implicit labels for creating such a user interface is
<form id="Book"> <p><label>ISBN: <output name="isbn" /></label></p> <p><label>Title: <input name="title" /></label></p> <p><label>Year: <input name="year" /></label></p> <p><button type="button">Save</button></p> </form>
In an HTML-form-based user interface, we have a correspondence between the different kinds of properties defined in the model classes of an app and the form controls used for the input and output of their values. We have to distinguish between various kinds of model class attributes, which are mapped to various kinds of form fields. This mapping is also called data binding.
In general, an attribute of a model class can always be represented in the user
interface by a plain input
control (with the default setting
type="text"
), no matter which datatype has been defined as the range of the
attribute in the model class. However, in special cases, other types of input
controls (for instance, type="date"
), or other controls, may be used. For instance,
if the attribute's range is an enumeration, a select
control or, if the number of
possible choices is small enough (say, less than 8), a radio button group can be used.