HTML5 is a format for a text file, which makes pretty stuff happen to the text when viewed in an HTML5 browser. The text file may be encoded in many ways, but the most versatile and broadly accepted format in my experience is UTF-8.

I wrote this article to explain the differences between an element, a tag, and a character reference, and to be sure I understood the syntax in minutia. Don't panic. Learning this stuff mostly just takes practice and frequent little reminders.

HTML5 special characters

HTML5 understanding can be learned a single byte or character at a time. Quite literally. Take a look at the special characters of HTML5 (gladly, it is the same special characters found in many related markup languages too).

Chevrons < >
Left and right angle brackets, also known as left and right chevrons, also known as less than sign and greater than sign. These two characters play an important role in HTML5 and many predecessor file formats. Also, I hear that chevrons actually came before parentheses in written English.
Spaces
Multiple spaces such as new line, line feed, tab, and space characters are interpreted as a single space in the browser, with few exceptions. The main benefit is that space can primarily be used for readability of the source, with minimal consideration for browser presentation. In cute summary, I just say that space no longer has a plural: space or not to space, that is the question — how much space is only limited by my real estate (bandwidth/storage) and my spatial discretion.
Ampersand &
The ampersand character came from a ligature of et in Latin which means and. Like chevrons, the ampersand takes on special meaning in HTML5 documents.
Quote (single/double) characters ' "
A matching pair of quotes (single/double) may wrap an attribute value.

HTML5 escapes

HTML5 gives special meaning to some common characters to enhance expressiveness, but what about when I want to express the regular meaning of those special characters? How do I show < > in HTML5 documents? To allow full freedom of expression, I need some way to escape special characters.

Character reference (HTML5 entity)
Starts with a & character and ends with a ; character. Between start and end is a pre-defined name or numeric (decimal/hexadecimal) value. Valid character references are interpreted as the character they represent, except they will be ignored in raw text elements (script, style), and other methods of escaping (comment tag, CDATA tag).
  • Use character references to avoid chevrons and ampersand taking on their special meaning. To represent < > &, use &lt; &gt; &amp; character references, respectively.
  • Use character references to avoid quotes (single/double) taking on special meaning to end an attribute value. To represent " ', use &quot; &#39; character references, respectively.
  • Besides these most essential character references, I can use character references when the file encoding does not support a character directly.
  • I can use character references for characters not supported by my editor such as umlauted characters ö Ö via &ouml; &Ouml;
  • All Unicode characters such as Σ summation symbol can be escaped as a character reference via decimal Unicode &#931;, or hexadecimal Unicode &#x3a3;. These numeric character references are Unicode only, and ignore the code page (encoding).
  • I can use character references for convenience or fun! To represent ♥ ♦ ♣ ♠ playing card suits, I can use &hearts; &diams; &clubs; &spades;
    
    ♜♞♝♛♚♝♞♜
    ♟♟♟♟♟♟♟♟
    
    
    
    ♙♙♙♙♙♙♙♙
    ♖♘♗♕♔♗♘♖
    
    &#x265A;—&#x265F; and &#x2654;—&#x2659;
Comment tag
Starts with <!-- and ends with the first --> from the start. A comment escapes special meaning of all special characters falling between start and end of the comment tag. This commented text will hide from normal parsing and display also. Comment tags cannot nest within comment tags or any other tags. Use comment tags to:
  • Quickly hide large sections of HTML5 which may or may not include special characters, namely < > & " ' left chevron, right chevron, ampersand, double quote, and single quote.
  • Make public notes which have no display value except when viewing the HTML5 code directly.
CDATA tag
Starts with <![CDATA[ and ends with the first ]]> to follow the start. I can use CDATA tags to escape < > & " ' special characters within foreign elements (math or svg elements). CDATA tags capture pure text in the document's encoding — ignores all apparent character references and chevrons within the tag.

I wrote some sample escaping and unescaping functions in javascript. These functions are space-conservative and not universal but do capture the most typical means of escaping and unescaping HTML special characters. To work properly, it is important to escape & first and unescape &amp; last. Also, note I had to use escaping so the browser would render properly!


escapeHtml = function(html)
{
	return new String(html)
		.replace(/&/g, "&amp;")
		.replace(/"/g, "&quot;")
		.replace(/'/g, "&#39;")
		.replace(/</g, "&lt;")
		.replace(/>/g, "&gt;");
}
unescapeHtml = function(html)
{
	return new String(html)
		.replace(/&gt;/g, ">")
		.replace(/&lt;/g, "<")
		.replace(/&#39;|&#x27;|&apos;/g, "'")
		.replace(/&quot;/g, '"')
		.replace(/&amp;/g, "&");
}

HTML5 tags and other text

HTML5 includes tags and other text (non-tag).

Tags
With chevrons, I can form tags that enhances text content with special semantics (meaning). Each tag starts with a < left chevron, and ends with the > right chevron. Most tags end with the first following > right chevron outside attribute quotes (except for comment and CDATA tags as described). Tags do not overlap other tags, and tags cannot be nested within tags. Tags do not have space following < left chevron, but may have space preceding the > right chevron (except for comment and CDATA tags as described). As consequence, left chevron with space following, and any unmatched right chevron may be used cautiously without escaping.
Other text (non-tag)
Everything outside tags. Other text (non-tag) may include space anywhere.

HTML5 Elements

Now that I've got past some ground rules, I can describe the core construct of HTML5 — the element. Elements define semantics. Element attributes and element content can control presentation (display/layout/arrangement/flow), and behavior (crawlers/caching/animation/re-flow) when displayed from a browser. To me, elements are simply the ♥ of HTML5.

Element

Starts with a start tag, then content (innerHTML), and ends with an end tag. Occasionally, the start tag is optional, sometimes the content (innerHTML) is optional or not allowed, and sometimes the end tag is optional or not allowed.

Nesting / parents

I can nest elements within the content (innerHTML) of other elements. Elements cannot partially contain other elements, so all elements starting before the end of an element are considered child elements, and all child elements end at or before the end of the parent element. Each element has one parent, except the root element which has none.

Tag name

Alphanumeric string to label and match start and end tags. The tag name is insensitive to case so <HTML> and <html> are both acceptable as the root start tag.

Attribute name

String. Must not contain any Unicode control characters, non-Unicode characters, space nor any of " ' > / = double quote, single quote, right chevron, slash, nor equals sign.

Attribute

Starts with an attribute name, followed by optional space, followed by optional equals sign, followed by optional space, followed by optional attribute value (quoted or unquoted, but the attribute value must be present whenever equals sign is present). If an equals with value is omitted from the attribute, the value is considered as the empty string implicitly.

Many elements have special attributes, but all elements may have Global Attributes or Event Attributes.

Boolean attribute
Some attributes need no attribute value by definition so attribute value is not required nor meaningful.
Attribute value

I can specify an attribute value as unquoted value, single-quoted 'value', or double-quoted "value". The value itself may include character references, but must not contain an ambiguous ampersand. Within attribute values, I must use character references &lt; &gt; &amp; &quote; &#39; for < > & " ' characters, respectively.

Unquoted attribute value
I must follow an unquoted value with a space (to continue the start tag), > right chevron (to finish the start tag), or, /> slash and right chevron for self-close syntax. In the value of an unquoted attribute, I cannot have literal space, quotes (' " ` single, double, nor back tick), equals sign (= parses in browsers anyway), nor chevrons (< > left nor right) in the value. I ♥ unquoted attributes.
Quoted (single/double) attribute value

I may include space in the value, may skip space after the end quote, and must escape some characters as already mentioned. I am glad that attribute values wrapped in single quotes may use double quote literal characters, and vice versa. I cannot quote the same quotes within quotes due to ambiguous parsing problems that would arise if allowed. In other words, values wrapped in single quotes cannot contain a literal single quote, and values wrapped in double quotes cannot contain a literal double quote.

Start tag

All start tags begin with < left chevron, then a tag name, optional space, optional attributes, optional space and a > right chevron at the end. Some start tags have attributes <a href=shadow.html title="my favorite document">. Some start tags are optional as described below.

html
If I omit the start tag, the root element starts implicitly with the first non-comment in the document (basically any element or other text (non-tag non-space)).
head
If I omit the start tag, the element starts implicitly with the first element, or is considered empty if the body element starts first.
body
If I omit the start tag, the element starts implicitly with the first element other than (meta, link, script, style, template, base), starts with the first other text (non-tag non-space) outside those elements, or is considered empty.
colgroup
If I omit the start tag, the element starts implicitly with col start tag not contained in a colgroup element, and the element must not be empty. The colgroup element cannot implicitly start before a sibling element ends.
tbody
If I omit the start tag, the element starts implicitly with tr start tag not contained in a sibling (thead, tfoot, tbody), and the element must not be empty. The tbody element cannot implicitly start before a sibling element ends.
Content (innerHTML)

Everything between a start tag and matching end tag. Content (innerHTML) starts just after the start tag (or its implicit start) and may contain tags, and other text (non-tag). Content (innerHTML) ends just before the end tag (or its implicit end) is reached. Void elements cannot have content (innerHTML).

End tag

All end tags begin with </ left chevron slash, then a tag name, optional space, and a > right chevron at the end. End tags cannot have attributes. Most start tags such as <div> must have a matching end tag </div> always. However, the end tag may be omitted for the following elements:

body
If I omit the end tag, the element ends implicitly at end of the parent htmlelement or at end of the document. All spaces and comments extend the implicit end of the element by inclusion.
colgroup
If I omit the end tag, the element ends implicitly at end of the parent table element. All spaces and comments extend the implicit end of the element by inclusion.
dt
dd
If I omit the end tag, the element ends implicitly with the next dt or dd start tag or at end of the parent dl element.
head
If I omit the end tag, the element ends implicitly at start of the body element, or at end of the document. All spaces and comments extend the implicit end of the element by inclusion.
html
If I omit the end tag, the element ends implicitly at end of the document. All spaces and comments extend the implicit end of the element by inclusion.
li
If I omit the end tag, the element ends implicitly with the next li start tag or at end of the parent ol or ul element.
optgroup
If I omit the end tag, the element ends implicitly with the next optgroup start tag or at end of the parent select or datalist element.
option
If I omit the end tag, the element ends implicitly with the next optgroup or option start tag or at end of the parent optgroup, select or datalist element.
thead
tfoot
tbody
If I omit the end tag, the element ends implicitly with the next thead, tfoot or tbody start tag or at end of the parent table element.
tr
If I omit the end tag, the element ends implicitly with the next tr start tag or at end of the parent thead, tfoot or tbody element.
td
th
If I omit the end tag, the element ends implicitly with the next td or th start tag or at end of the parent tr element.
p
If I omit the end tag, the element ends implicitly at the start of the next
  • paragraph (p),
  • table (table),
  • list (dl, ol and ul),
  • section element (article, aside, blockquote, nav and section)
  • section enhancement (address, footer, h1, h2, h3, h4, h5, h6 and header),
  • grouping element (div, main, p and pre),
  • form container (fieldset and form),
  • thematic break (hr),
OR at the end of the parent (except if the parent is an a anchor element). In other words, a p paragraph element cannot contain ANY of the elements above that implicitly end it (none can be children).
rb
rt
rtc
rp
If I omit the end tag, the element ends implicitly at the end of the parent or at the next rb, rt, rtc or rp start tag, except that an rt element can be a child of an rtc element so an rt start tag cannot implicitly end an rtc element.
area
base
br
col
embed
hr
img
input
keygen
link
meta
param
source
track
wbr
These are void elements so I must not ever add an end tag to any of them.

Regular expressions for end tag omissions

The following describes my aggressive removal of end tags as an extension of my coding style.

  • This regular expression finds all end tags that I must remove.
    /<\/(area|base|br|col|embed|hr|img|input|keygen|link|meta|param|source|track|wbr)\s*>/i
  • This regular expression finds all end tags that I like to remove, but I may keep.
    /<\/(p|colgroup|thead|tfoot|tbody|tr|td|th|option|li|dt|dd|rb|rt|rp)\s*>/i
  • This regular expression finds other end tags that I like to keep, but I might remove.
    /<\/(html|head|body|optgroup|rtc)\s*>/i
  • This regular expression finds all start tags that must have a matching end tag.
    /<(canvas|noscript|script|templateaddress|article|aside|nav|section|footer|h1|h2|h3|h4|h5|h6|header|title|blockquote|div|dl|figcaption|figure|main|ol|pre|ul|a|abbr|b|cite|code|data|del|dfn|em|i|ins|kbd|mark|q|s|samp|small|span|strong|sub|sup|time|u|var|bdi|bdo|ruby|audio|iframe|map|object|video|caption|table|button|datalist|fieldset|form|label|legend|meter|output|progress|select|textarea)/i

Element "flavors"

The bulk of HTML5 semantics are in the elements, so now I can discuss different kinds of elements. The semantics and syntax of each element depends firstly on the tag name.

Void element
A void element is identical to a start tag syntactically. However, void elements must not have an end tag nor content (innerHTML). So the end of the start tag of a void element always implies the end of the element.
<input name="pi" type="number">.
For xml compatibility, void elements may have a slash in what's called self-close syntax.
<input name="pi" type="number" />.

Raw text element
Raw text elements must not contain this sequence: < left chevron, / followed by slash, followed by the tag name of the raw text element, followed by any space, followed by / slash or > right chevron. In other words, a valid end tag of the raw element will end the element no matter how or where it occurs in the raw text. For instance, quoting or commenting in a script does not prevent the end of a script element. Also, any other apparent tag (start, end, comment, etc.) contained in a raw text element is interpreted as regular text instead of as a tag.
script
style
Inescapable raw text elements cause the browser to ignore/escape all tags except for its own end tag. These elements also cause the browser to ignore/escape all character references, meaning they are interpreted as plain text instead.
textarea
title
Escapable raw text elements cause the browser to ignore/escape all tags except for its own end tag. These elements do not cause the browser to ignore/escape character references.
Foreign elements
Elements from the MathML namespace or the SVG namespace. Unlike XML, namespace prefixes on tag names are not allowed, even for foreign elements.
Normal elements
All other allowed HTML5 elements are called normal elements.

Global Attributes

The following attributes are considered generally applicable. I consciously reserve global event attributes for a separate article on how to watch HTML5 events, so I omit them here.

accesskey
May apply to any element. Adds keyboard key(s) to focus or activate the element.
aria-state
aria-property
May apply to any element. Any ARIA state and property attributes applicable to the allowed roles as specified in the HTML5 recommendation.
class
May apply to any element. An unordered space-separated list of classes assigned to the element.
contenteditable
May apply to any element. Default is inherited from the parent if missing. Specifies whether content is editable. Possible values:
true
The element may be edited.
false
The element may not be edited. Default value of the root element.
draggable
May apply to any element. Specifies whether the element is draggable.
true
The element may be dragged.
false
The element may not be dragged.
auto
Any img and a elements may be dragged if they have a URL.
data-custom
May apply to any element. Any attribute beginning with data- is called a custom attribute which can be attached to any element as non-visible data. Element data from a custom attribute named data-abc-xyz can be accessed in javascript via element.dataset.abcXyz.
dir
May apply to any element. Isolates the directionality of element contents with possible values as follows:
ltr
Text is displayed from left-to-right.
rtl
Text is displayed from right-to-left.
auto
Direction to be determined by algorithm.
hidden
May apply to any element. Hides the element and all child content when present. Boolean attribute (has no attribute value).
ID
May apply to any element. Specifies a unique identifier string for the element within context of the document. Space is not allowed in an identifier. For javascript access to an element with ID=myId, the following are equivalent.
  • var element = document.getElementById("myId");
  • var element = document.querySelector("#myId");
lang
May apply to any element. Specifies the primary language of content within the element following the BCP 47 standard. I want to specify the lang attribute on the html element so that my title element, meta elements' attribute values, and body element are all labeled with my default language.
role
May apply to any element. Supports ARIA role attribute values allowed by the HTML5 recommendation.
spellcheck
May apply to any element. Default is inherited from the parent if missing. Specifies whether to check spelling of editable content and mutable inputs. Possible values:
true
The element may be checked for spelling.
false
The element may not be checked for spelling.
style
May apply to any element. Container for direct CSS styling of the element.
tabindex
May apply to any element. A valid integer to specify focus ordering.
Negative values
Focusable, but not focusable via sequential tabbing unless that's the only way to focus it.
Zero value
Focusable by default platform sequencing.
Positive values
Cause focus in sequence from low to high, followed by first to last, but all positive must be focusable before non-positive integers.
Recommended on the following elements:
iframe
object
A browsing context container.
a
link
If the href attribute is set.
button
input
keygen
object
select
textarea
Fields (if not hidden).
element contenteditable=true
If it has content.
title
May apply to any element. Describes advisory tooltip information.

translate
May apply to any element. Default is inherited from the parent if missing. Defines whether an element's content and translatable attributes should be localized. Possible values:
yes
The element may be translated. Default value of the root element.
no
The element may not be translated. Examples:
  • computer I/O
  • Content intended to be expressed in the one particular language.
xml:base
xml:lang
May apply to any element, but only applicable to XML documents.

YES!! I have survived this exposé on how to plumb HTML5 documents, and I hope you have too.