Authoring Content in XML
![[XML Man]](/images/xmlman.png)
If you are familiar with the HTML specification, you may wish to hand code your content using a standard text editor. This is the method I personally prefer.
The text editor should support UTF-8 ideally with UNIX line breaks. On Linux/UNIX any current text editor should do the right thing. I personally use Bluefish, though I much prefer 1.0.7 to the newer 2.x series (due to feature bloat). On OS X any current text editor should do the right thing. I personally recommend BBEdit. On Microsoft Windows I highly recommend against using Notepad. I suggest using PSPad or Notepad++. When using those Windows editors, make sure to specify UTF-8 encoding as that is not the default.
UTF-8 support in your text editor is critical. The DOMDocument class used by the DOMBlogger CMS currently only supports XML in the UTF-8 character encoding.
The type of line break your text editor uses is probably less critical.
Basic Document Structure
The very first line needs to declare your document as UTF-8 XML:
<?xml version="1.0" encoding="UTF-8"?>
The root node of your document needs to be a div node. When the DOMBlogger CMS imports your XML to generate a web page or EPUB3 archive, it grabs that node as the content node.
For Search Engine Optimization you should add a single metadesc node with text contents being what you want to have appear in the description meta tag. This tag is not valid HTML5 and will not be included in the document when served, it is a DOMBlogger custom tag that tells the CMS what you want it to put in the valid description meta tag.
Optionally, you can add a keywords attribute to specify the contents of the keywords meta tag. For an article on model trains, your XML might now look something like this:
<?xml version="1.0" encoding="UTF-8"?> <div> <metadesc keywords="model trains,n scale,ho scale"> A discussion of model trains and various size scales that are commonly used by hobbyists. </metadesc> </div>
You can also specify the page description and keywords through the CMS interface but I prefer to keep them with my source document.
From this point forward, you can add almost any element that is defined in HTML5, MathML, or SVG and is appropriate within a body node.
XML Primer
XML is very strict. Element (also known as tag or node) names are case sensitive. All elements must be closed. Only named entities that are known to the XML parser or defined in a referenced DTD are allowed.
The DOMBlogger CMS helps to ease some of this. When your XML file is uploaded, the CMS will translate most named entities for you (specifically entities defined in HTML and MathML) into their UTF-8 character or numbered entity equivalents. It also passes your XML through tidy in XML mode to clean up any code issues that might prevent your XML from being parsed by an XML parser.
It is better however for you to properly code from the start. Sometimes the way libtidy corrects your XML may not be what you expect.
Case Sensitivity
XML is case sensitive. XHTML uses lower case element names and attributes. DOMBlogger makes no attempt to compensate for the use of upper case element names or attributes, so make sure you use correct lower case element names and attributes.
Element (Tag) Sanity
HTML is a rather sloppy standard. Some elements (e.g. br and img) are never closed. Some elements (e.g. p and li) are optionally closed but may be left open.
XML is not sloppy. All elements must be closed. Take the following valid HTML snippet:
<p>This is a paragraph. <p>This is another paragraph. <p>This is a third paragraph.
For that to be proper XML, we have to close the p elements:
<p>This is a paragraph.</p> <p>This is another paragraph.</p> <p>This is a third paragraph.</p>
Some elements are not allowed to have a separate closing tag. This is because technically they are not allowed to have children. To close those elements, we add the closing / inside the initial element declaration itself. The HTML
<img src="/images/foo.gif" alt="[some desc]"> <br> <hr>
would be represented in XML as
<img src="/images/foo.gif" alt="[some desc]" /> <br /> <hr />
Named Entities
A named entity is a way of reference a character with a name that is usually easier for humans to remember than their characters number equivalent. They are usually defined for characters that have special meaning when used directly or that may not be directly available on a standard keyboard.
For example, < and > have special meaning in XML and most keyboards do not have a way to directly type the © symbol.
In general with a few noted exceptions it is best not to used named entities. Instead, try to use the actual UTF-8 character or use a numbered entity. For example, to indicate the copyright symbol, enter an actual © instead of ©.
You can find a list named entities and the UTF-8 character equivalents that you can copy and paste from at the Wikipedia Character Entity Reference.
Note that there are some named entities that are defined in raw XML and are always safe to use in XML:
Named Entity | Result | Note | |
---|---|---|---|
& | & | Always use named entity to represent this character. | |
' | ' | In most cases you can just type this character directly. | |
" | " | In most cases you can just type this character directly. | |
< | < | Always use named entity to represent this character. | |
> | > | Always use named entity to represent this character. |
In addition to those five named entities, there are few cases where you may not be able to use the actual character. For example, in HTML the named entity is used to create a non breaking space. Since the named entity is not defined in raw XML without a DTD and it is not an easy character to copy and paste, you should use the numbered entity   to indicate a non breaking space.
Another case if you use MathML is the named entity ⁢. It is used when you want a multiplication operator that is not actually displayed. Since the actual character has no width, you should use the numbered entity ⁢.
DOMBlogger does try and compensate for the use of named entities in a content containing XML file, but since no DTD is given to define them, it really is best practice to avoid any except the five that are part of the XML spec.
The HTML5 Article
When producing an HTML5 document, content that would be considered to be an article should use the HTML5 article semantic markup.
The specification indicates that an article element is a self-contained composition
in a document, page, application, or site and that is, in
principle, independently distributable or reusable,
e.g. in syndication.
(from
W3C HTML5 Overview on ).
They then go one to describe rather intricate cases where article elements can contain child article elements, etc.
I like to keep things KISS. When I implement an article element, it indicates the contents is a self-contained composition in a document that would be considered to be an article by a normal person who does not care what HTML is. I never contain articles within articles. Even though it is legal to do so, in most cases I consider it to be a poor coding practice. If I have more than one article on a page, they are independent children of a parent div element (usually the main content div).
Keeping the concept simple makes writing code that does things with the content simpler, and simpler code is less likely to have bugs.
I suggest sticking to the simple definition of the article element that I embrace, but of course you are free to do things your own way if you so choose.
You should avoid using the HTML5 article structure for content that is not an article. For example, the index page to DOMBlogger Documentation is not a self contained composition. Rather it points to pages that have self contained compositions. As such it is not an article and I do not place it within a semantic article structure.
HTML5 validators do not have artificial intelligence to determine what is an article and what is not. It is up to the content producer to properly semantically tag their content.
The DOMBlogger CMS can generate the HTML5 article structure for you. You do not need to worry about implementing the HTML5 article structure yourself unless you do not like the manner in which DOMBlogger generates the structure (in which case feedback would be appreciated.)
However, it probably is a good idea for you to be familiar with the key semantic components of an HTML5 article.
The article Element
This is the root element of an HTML5 article. Even though the specification specifically allows it, it is in my opinion better if it is not the child node of a parent article element. With most modern layout schemes, it usually should be a direct child of your documents main content div node.
This structurally semantic element is usually associated with an h1 level heading.
The section Element
Within the context of an article, a section indicates a grouping of related content that makes up a part of the article. Unless your article is very brief, it probably will need to have one or more sections defined within it.
Within the context of an article, a section element should be the direct child of a parent article or section element. You can have as many sections within the structural parent as your content needs.
This structurally semantic element is usually associated with h2 through h6 level headings.
The header Element
This is an optional strictly semantic element and when used in the context of an HTML5 article, it should be the very first child of a parent article or section element.
The header element usually contains the heading element associated with the content and may also contain meta information about the content. Examples of such meta information includes authorship, modification date, and a navigation list to sub sections.
The footer Element
This is an optional strictly semantic element and when used in the context of an HTML5 article, it should be the very last child of a parent article or section element.
It is very much like the header element in the kind of content it contains, although it probably should never contain a heading. Frequently it is used for copyright notices, footnotes, etc.
Other Elements
For a full listing of HTML5 elements and discussion of their use, please see the HTML5 Doctor Element Index. There are some elements however that I would particularly like to discuss here. These elements can be used in the context of an HTML5 article but do not have to be.
The nav Element
The nav element is a semantic container element that indicates its contents are hypertext navigation links.
In common practice, a nav element has an unordered list (ul element) of hyperlinks, but there is no rule on how the navigational links are structured. Unordered lists are popular because they can have their layout easily customized via CSS.
Since there is a one to one correspondence between the semantic meaning of a nav element and the semantic meaning of the ARIA role="navigation" attribute, the DOMBlogger CMS will automatically add that attribute to any nav elements in your content.
When authoring CSS class styles for the nav element, you should not assume any nav elements you author will be served as nav elements to requesting clients. When the DOMBlogger CMS serves content to web browsers that do not support the application/xhtml+xml mime type, any nav elements are changed into div elements. This is done to solve some rendering problems that only happened in those browsers.
The details Element
The details element is used to make content available that the reader may choose to show or hide. It has a child element named summary that is always presented to the user. Example:
<details data-addbutton="true"> <summary>Detailed information on Jabberwocky</summary> <p>From <a href="http://en.wikipedia.org/wiki/Jabberwocky">Wikipedia</a>:<br/> “Jabberwocky” is a nonsense verse poem written by Lewis Carroll in his 1872 novel Through the Looking-Glass, and What Alice Found There, a sequel to Alice’s Adventures in Wonderland.</p> </details>
If your browser supports the details element it should only show the summary until the user toggles the summary to see the additional content.
For browsers that do not have native support the details element, DOMBlogger emulates the functionality of the details element if JavaScript is enabled. The above code produces the following result:
Detailed information on Jabberwocky
From Wikipedia:
“Jabberwocky” is a nonsense verse poem written by Lewis Carroll in
his 1872 novel Through the Looking-Glass, and What Alice Found
There, a sequel to Alice’s Adventures in Wonderland.
Emulating details
only Google Chrome actually supports the details element, and unfortunately they do so in a way that does not allow opening and closing the node without the use of a mouse. For users who have trouble operating a mouse, this is a major problem.
For other browsers, we attempt to emulate the proper functionality of the details node with JavaScript.
The goal of emulation is to allow browsers that do not have native support details to function as if they did without the content producer needing to do anything special in their markup to trigger the emulation.
More information on our details emulation can be found at the JSdetails Project Page.
Our emulation is currently incomplete:
Internet Explorer
Emulation fails in Internet Explorer (tested version 8 in Windows XP Pro). The details always show. This is certainly fixable and is likely yet another case of Internet Explorer taking a different interpretation on how to render CSS.
details Accessibility
The manner in which Google Chrome implements details is not accessible to users who have trouble operating a mouse. If you navigate a web page with the tab key, it will navigate right past it. Setting the tabindex attribute allows selection of the summary node but you still can not toggle it.
Our emulation for other browsers in theory should be accessible but seems to only be accessible in the Opera browser. Other browsers fail to implement a way to trigger the click event needed to toggle the display of the content.
To remedy this situation, a standard HTML button element can be added to toggle the state of the details node. Unfortunately we can not add the button by default since it might interfere with the intended layout of the web page.
If you use the details node, we suggest you add the attribute data-addbutton="true" so that the DOMBlogger CMS will know it is OK to add the button element needed to make your content properly accessible.
The figure Element
The figure element is used to indicate the content within is a figure related to the main article content. You can optionally use the figcaption element to give a caption to the figure. I highly recommend you do so. Adding a figcaption element allows you to provide a proper description of the figure that may be of benefit to users who may not be able to see graphical representations you have in the figure.
Image Container
One recommended use of the figure element is to use it as a wrapper container around an image. This allows you to then use the figcaption element to provide a detailed caption to accompany the image.
Traditionally image captions have been provided using the title attribute but this is no longer recommended. There are three main problems with using the title attribute:
- Most browsers do not provide ability to see the contents of the title attribute without the use of a mouse.
- Currently it appears that no mobile browser (e.g. for Android or iOS) ever display the contents of a title attribute.
- The contents of a title attribute can not contain additional markup.
If your image needs a caption, it is thus better to place the image inside a figure node and use the figcaption element to provide that caption.
An excellent article on the use of the figure element as a wrapper container for images can be found at www.paciellogroup.com.
Floating Figures
The figure element can also be used when layout placement of the figure is not critical, such as when it is OK for the figure to be floated down within the document. I do not know of any browsers that do float figures and I kind of doubt any ever will, but typesetting software (such as LaTeX) frequently does float figures.
If you want figures floated to the end, it is my suggestion to manually create a h2 heading towards the end of your content called Figures and put the figures you want floated to the end there, preferably using a h3 heading for each figure.
At some point in the future, the DOMBlogger CMS may provide facilities by which figures can be optionally floated to the end of an HTML5 article when served yet remain placed near the content that references them in your source XML to allow for easier document maintenance. This has not yet been implemented.
The strong Element
The strong element is used to semantically indicate strong importance of the content. Functionally it is very similar to the b element in that it usually results in a bold rendering of the content, but it is better from a conceptual point of view since it separates design from content. I strongly recommend that you use the strong element and avoid using the older b element.
The older b element is still valid, but in most cases if the effect you want is embolding of text without semantically indicating strong importance, you probably should do so with CSS.
The em Element
The em element is used to semantically indicate an emphasis of the content. Functionally it is very similar to the i element in that it usually results in an italic rendering of the content, but it is better from a conceptual point of view since it separates design from content. I urge you to use the em element and avoid using the older i element.
The older i element is still valid, but in most cases if the effect you want is slanting of text without semantically indicating emphasis, you probably should do so with CSS.
The time Element
This element has no impact on visual rendering and I am not aware of any browsers that actually do anything with it. Where I tend to use it is when referencing material I expect may change, especially when referencing material of a controversial nature on publicly modifiable web sites like Wikipedia. The time element gives me a convenient way to put a time stamp into my document that I can refer to at a later date if I need to understand the time context of the content.
When used by itself with no attributes, it simply indicates that the contents of the node is a date, time of day, or both. For example:
<p>The radio show starts at <time>9:00 P.M.</time> every week.</p>
The datetime attribute
This is where the real benefit is at, and why I use it. With the datetime attribute the contents of the node can be almost anything because the attribute provides the standardized date and/or time.
With the datetime attribute, you can specify a specific date and/or time using a well understood standard format. For example:
<p>I am out of the office and will be returning on <time datetime="2011-11-16T09:30-08:00">Wedensday</time>.</p>
The attribute must have either the date and/or the time and it uses the following convention:
date
The accepted date format is YYYY-MM-DD.
time
The accepted time format is hh:mm:ssTZD.
For time, hour and minute (hh:mm) are required and use a 24-hour clock. Seconds (ss) are optional. TZD is optional if a date is not specified but it is required if a date is specified.
The TZD indicates offset from Greenwich Mean Time
(a.k.a. UTC). For example:
-08:00 for Pacific Standard time (and
-07:00 for Pacific Daylight Time). To
specify UTC itself you can just use a Z (as in Zulu) for the TZD.
When both date and time are specified, date comes first and the character T is used as a separator between the date and time.
The pubdate attribute
This is a Boolean attribute that indicates the specified date is the date of publication. With Boolean attributes, the value of the attribute does not matter, the presence of the attribute itself indicates the condition is true. In HTML you do not need to assign a value to Boolean attributes but in XML every attribute must have a value. It is customary to assign the value of Boolean attributes to the name of the attribute. For example:
<time datetime="2011-11-19" pubdate="pubdate">November 19, 2011</time>
There are some restrictions on where this attribute should be used. In the context of semantic article markup only one time element should have the pubdate attribute. Outside of the semantic article context it references the document as a whole and only one time node may have the attribute.
While not strictly required, a time node that has a pubdate attribute should also have a datetime attribute.
NOTE: This is a classic example of why I do not agree with article nodes having child article nodes.
If scraping an article for the publication date and the article has multiple articles within, you have more complex code because you need to determine the context of any time nodes that have the attribute to determine if they apply to a sub article (and which sub article) or the larger parent article.
Keep It Simple, Silly!
The data-addtitle attribute
This is not a standard HTML5 attribute and has no official meaning outside of the DOMBlogger CMS.
If you have specified a datetime attribute and you also specify the DOMBlogger custom attribute data-addtitle, the CMS will create a title attribute for your time element with a human readable interpretation of the date and/or time specified in the datetime attribute.
The DOMBlogger standard style sheet will mark the contents of any time element that has a title attribute with a soft dotted underline so that users can know additional context is available. Example: .
Please be aware that this is of very limited value. No one using a mobile browser will be able to see the contents of the generated title attribute.
Desktop users who have difficulty operating a mouse will not be able to see the contents of the generated title attribute.
Screen readers may ignore the contents of the generated title attribute.
Individuals who have trouble with low contrast may not notice the dotted underline and may not even be aware that they can get additional context by putting their mouse over the rendered element.
Individuals with perfect vision may not understand that the dotted underline indicates additional context is available.
If the time or date is critical to understanding the context of your content, you can not rely upon this feature to provide it to your users. Instead you should insert the time or date as normal text to give the necessary context.
The abbr element
If you use acronyms or abbreviations in your content, they should be wrapped inside of an abbr element. It may help screen readers apply the correct logic in figuring out how to pronounce the content and it also may prevent accidental translation of acronyms when a reader uses language translation software.
The title attribute
The title attribute is optional and can be used to indicate the full meaning of the acronym or abbreviation. You should not however rely solely upon the title attribute to make the meaning known to your readers.
Whether or not you use a title attribute any abbreviation or acronym should be defined either in plain text the first time it is used or alternatively in a list of abbreviations.
Within the context of an HTML5 article any list of abbreviations should probably be within the article structure itself so that it remains with the article if the article is scraped by a content scraper for presentation elsewhere.
If the DOMBlogger CMS generates the article structure for you, a list of abbreviations and their corresponding expansion will be added to the footer container at the end of the article. The CMS is only able to do this for abbr elements that have the title attribute.
Best Practices
It is our opinion that the title attribute should almost always be used with an abbr element that is a child of a heading element (h1 through h6).
The reason for this is strictly SEO. In the rare cases where an abbreviation used in a heading is also spelled out in the heading, then the title attribute need not be used there.
Outside the context of a heading, it is our opinion that the title attribute should only be used once per abbreviation, and only if it has not already been used in the context of a heading.
The reason for this is that it apparently can be quite annoying for users of screen readers that support the title attribute in a abbr element to have it present each and every time a particular abbreviation is used.
You do not really even need to use the title attribute at all if the first use is accompanied by an explanation of the abbreviation, but if you want it to be listed in the generated list of abbreviations that DOMBlogger creates when it constructs HTML5 article structure, then it needs to be used at least once within the article.
The contents of the title attribute should be in the same language of the web page you are using. If your web page is in English then title="exempli gratia" would not be appropriate for the the abbreviation e.g. as it does not help the reader understand what it means.
DOMBlogger Rendering of the abbr Element
Most browsers by default will render abbr with a soft dotted line under it. Unfortunately this can become aesthetically very distracting and annoying. This aesthetic annoyance has unfortunately resulted in some web developers neglected to use the tag at all.
I believe the logic behind this odd rendering is the assumption that there will be a title attribute, so the browsers use the visual cue to inform users they can put their mouse over the abbreviation to see what it means.
The DOMBlogger CMS over-rides this default rendering. In our common.css file we remove any styling from the abbr element.
When the abbr element has a title attribute, then we specifically apply the soft dotted line underneath the abbreviation unless it is a child of a heading node or a hyperlink.
This design decision allows our users to use the abbr element freely without worrying about their content becoming too visually cluttered as it avoids the dotted underline in headings and when a title attribute does not exist.
The acronym element
Please note that the acronym element from HTML4 is not valid in HTML5. The DOMBlogger CMS will change any acronym nodes into abbr nodes for you, but you still should avoid using the acronym element.
I find it interesting that with all the new semantic tags and a push for semantic markup, they actually chose to remove a tag that conveyed semantic meaning.
Accessibility Issues
There are some accessibility issues related to the abbr element that I am still researching. Specifically, how does one specify the proper way to pronounce things?
There are several different types of abbreviations. One type is what is known as as initialism. With initialism, the letters of the abbreviation should be read as letters. HTML is an example. Typically initialisms are all upper case or have periods between the letters, and usually that is sufficient to identify to the user that the word is an initialism.
Some abbreviations however use all upper case letters but are intended to be read as a word. Such an example is SCUBA. This type of abbreviation is typically referred to as an acronym.
There are some cases where the correct pronunciation is mixed. MathML for example should be read as math followed by the pronunciation of the letters M and L. Generally these mixed cases are easy to identify because the part that is read as a word uses lower case letters after the initial letter.
There are some cases where the abbreviation when read should always be fully expanded. For example, W. Va. should be read as West Virginia. I like to call these abbreviations shorthand, but there may be a more precise term for them.
There are undoubtedly some cases where the correct pronunciation must be specified but is different than what would be appropriate for the title attribute. Where the appropriate pronunciation would be specified, I do not really know.
What we need is a proper method to cue appropriate pronunciations to screen readers. I do not know if WAI-ARIA provides facilities for this, or if it is the realm of aural style sheets, or how the problem should be approached.
If the proper way to solve it is through an aural style sheet, my guess is that the proper thing to do would be to apply a default aural class to the abbr element that reads upper case letters as spelled out letters unless they are followed by lower case letters, in which case read it as a word. That would probably do the right thing most of the time.
Potentially class="acronym" or something similar could be used for cases like SCUBA to indicate they should be read as a word instead of spelled out, even though all the letters are upper case.
For cases like the W. Va. example, a class or perhaps an aria attribute would be needed to instruct the screen reader that it should read the contents of the title attribute.
I am going to look into this issue some more and hopefully find a solution that properly works.
Forbidden Tags and Attributes
The DOMBlogger CMS does not allow in-line style or scripts for security reasons. If you use the script or style nodes within your content XML, they will be removed from your content before the page is served. You should instead use external scripts and style sheets that are referenced in the document head (which you can access through the CMS admin interface).
Similarly, the various on* attributes to trigger a script and the style attribute are forbidden and will be ripped from the document before it is served. You should use proper event handlers defined in an external script file to attach events to nodes and you should use the class attribute to reference a CSS class defined in an external CSS file instead of a style attribute.
Event Handlers
While generic JavaScript has had support for adding event handlers for some time, there are two competing methods for doing it. You will need a wrapper function to determine which browser is running your script and attach the event handler in a way that is compatible with that browser.
However if you use jQuery, it takes care of all that for you. DOMBlogger loads jQuery automatically with every page already, so if you need to utilize event handlers, you might as well take advantage of jQuery. See the jQuery Events API page for details.