HTML allows to mark up (or describe) the structure of a human-readable web document or web user interface, while XML allows to mark up the structure of all kinds of documents, data files and messages, whether they are human-readable or not. XML can also be used as the basis for defining a version of HTML that is called XHTML.
XML provides a syntax for expressing structured information in the form of an XML document with nested elements and their attributes. The specific elements and attributes used in an XML document can come from any vocabulary, such as public standards or (private) user-defined XML formats. XML is used for specifying
document formats, such as XHTML5, the Scalable Vector Graphics (SVG) format or the DocBook format,
data interchange file formats, such as the Mathematical Markup Language (MathML) or the Universal Business Language (UBL),
message formats, such as the web service message format SOAP
XML is based on Unicode, which is a platform-independent character set
that
includes almost all characters from most of the world's script languages
including Hindi, Burmese and Gaelic. Each character is assigned a unique
integer code in the range between 0 and 1,114,111. For example, the Greek
letter π has the code 960, so it can be inserted in an XML document as
π
using the XML
entity syntax.
Unicode includes legacy character sets like ASCII and ISO-8859-1 (Latin-1) as subsets.
The default encoding of an XML document is UTF-8, which uses only a single byte for ASCII characters, but three bytes for less common characters.
Almost all Unicode characters are legal in a well-formed XML document. Illegal characters are the control characters with code 0 through 31, except for the carriage return, line feed and tab. It is therefore dangerous to copy text from another (non-XML) text to an XML document (often, the form feed character creates a problem).
Generally, namespaces help to avoid name conflicts. They allow to reuse the same (local) name in different namespace contexts. Many computational languages have some form of namespace concept, for instance, Java and PHP.
XML namespaces are identified with the help of a namespace URI, such as the SVG namespace URI
"http://www.w3.org/2000/svg", which is associated with a namespace prefix, such as
svg
. Such
a namespace represents a collection of names, both for elements and
attributes, and allows namespace-qualified names of the form prefix:name, such as
svg:circle
as a
namespace-qualified name for SVG circle elements.
A default namespace is declared in the start tag of an element in the following way:
<html xmlns="http://www.w3.org/1999/xhtml">
This example shows the start tag of the HTML root element, in which the XHTML namespace is declared as the default namespace.
The following example shows an SVG namespace declaration for an
svg
element embedded in an HTML document:
<html xmlns="http://www.w3.org/1999/xhtml"> <head> ... </head> <body> <figure> <figcaption>Figure 1: A blue circle</figcaption> <svg:svg xmlns:svg="http://www.w3.org/2000/svg"> <svg:circle cx="100" cy="100" r="50" fill="blue"/> </svg:svg> </figure> </body> </html>
XML defines two syntactic correctness criteria. An XML document must be well-formed, and if it is based on a grammar (or schema), then it must also be valid with respect to that grammar, or, in other words, satisfy all rules of the grammar.
An XML document is called well-formed, if it satisfies the following syntactic conditions:
There must be exactly one root element.
Each element has a start tag and an end tag; however, empty
elements can be closed as <phone/>
instead of
<phone></phone>
.
Tags don't overlap. For instance, we cannot have
<author><name>Lee Hong</author></name>
Attribute names are unique within the scope of an element. For instance, the following code is not correct:
<attachment file="lecture2.html" file="lecture3.html"/>
An XML document is called valid against a particular grammar (such as a DTD or an XML Schema), if
it is well-formed,
and it respects the grammar.
The World-Wide Web Committee (W3C) has developed the following important versions of HTML:
2000: XHTML 1 as an XML-based clean-up of HTML 4,
2014: (X)HTML 5 in cooperation (and competition) with the WHAT working group supported by browser vendors.
HTML was originally designed as a structure description language, and not as a
presentation description language. But
HTML4 has a lot of purely presentational elements such as
font
. XHTML has been taking HTML back to its roots, dropping
presentational elements and defining a simple and clear syntax, in support
of the goals of
device independence,
accessibility, and
usability.
We adopt the symbolic equation
HTML = HTML5 = XHTML5
stating that when we say "HTML" or "HTML5", we actually mean XHTML5
because we prefer the clear syntax of XML documents over the liberal and confusing HTML4-style syntax that is also allowed by HTML5.
The following simple example shows the basic code template to be used for any HTML document:
<!DOCTYPE html> <html xmlns="http://www.w3.org/1999/xhtml" xml:lang="en" lang="en"> <head> <meta charset="UTF-8" /> <title>XHTML5 Template Example</title> </head> <body> <h1>XHTML5 Template Example</h1> <section><h1>First Section Title</h1> ... </section> </body> </html>
Notice that in line 1, the HTML5 document type is declared, such
that browsers are instructed to use the HTML5 document object model (DOM).
In the html
start tag in line 2, using the default namespace
declaration attribute xmlns
, the XHTML namespace URI
http://www.w3.org/1999/xhtml
is declared as the default
namespace for making sure that browsers, and other tools, understand that
all non-qualified element names like html
, head
,
body
, etc. are from the XHTML namespace.
Also in the html
start tag, we set the (default)
language for the text content of all elements (here to "en" standing for
English) using both the xml:lang
attribute and the HTML
lang
attribute. This attribute duplication is a small price
to pay for having a hybrid document that can be processed both by HTML and
by XML tools.
Finally, in line 4, using an (empty) meta
element with
a charset
attribute, we set the HTML document's character
encoding to UTF-8, which is also the default for XML documents.
For user-interactive web applications, the web browser needs to
render a user interface (UI). The traditional metaphor for a software
application's UI is that of a form. The special elements for data
input, data output and user actions are called form
controls or UI widgets. In HTML,
a form
element is a
section of a web page consisting of block elements that contain form
controls and labels on those
controls.
Users complete a form by entering text into input fields and by selecting items
from
choice controls, including dropdown
selection lists, radio button
groups and checkbox groups. A completed
form is submitted with the help of a submit
button. When a user submits a form, it is normally sent to a
web server either with the HTTP GET method or with the HTTP POST method.
The standard encoding for the submission is called URL-encoded. It is represented by the Internet media type
application/x-www-form-urlencoded
. In this encoding, spaces
become plus signs, and any other reserved characters become encoded as a
percent sign and hexadecimal digits, as defined in RFC 1738.
Each form control has both an initial value and a current value,
both of which are strings. The initial value is specified with the control
element's value
attribute, except for the initial value of a
textarea
element, which is given by its initial contents. The
control's current value is first set to the initial value. Thereafter, the
control's current value may be modified through user interaction or
scripts. When a form is submitted for processing, some controls have their
name paired with their current value and these pairs are submitted with
the form.
Labels are associated with a control by including the control as a
child element within a label
element
(implicit labels), or by giving the control an
id
value and referencing this ID in the for
attribute of the label
element (explicit
labels).
In the simple user interfaces of our "Getting Started" applications, we only need four types of form controls:
single line input fields
created with an <input name="..." />
element,
single line output fields
created with an <output name="..." />
element,
push buttons created with a
<button type="button">...</button>
element,
and
dropdown selection lists
created with a select
element of the following
form:
<select name="..."> <option value="value1"> option1 </option> <option value="value2"> option2 </option> ... </select>
An example of an HTML form with implicit labels for creating such a user interface is
<form id="Book"> <p><label>ISBN: <output name="isbn" /></label></p> <p><label>Title: <input name="title" /></label></p> <p><label>Year: <input name="year" /></label></p> <p><button type="button">Save</button></p> </form>
In an HTML-form-based data management user interface, we have a correspondence between the different kinds of properties defined in the model classes of an app and the form controls used for the input and output of their values. We have to distinguish between various kinds of model class attributes, which are mapped to various kinds of form fields. This mapping is also called data binding.
In general, an attribute of a model class can always be represented
in the user interface by a plain input
control (with the
default setting type="text"
), no matter which datatype has
been defined as the range of the attribute in the model class. However, in
special cases, other types of input
controls (for instance,
type="date"
), or other widgets, may be used. For instance, if
the attribute's range is an enumeration, a select
control or,
if the number of possible choices is small enough (say, less than 8), a
radio button group can be used.