XML Sucks
Saturday, November 16, 2013
Monday, December 11, 2006
File Formats
Scribe
TeX / LaTeX
As Steve Hirsch noted in a comment on this blog (before I moved it and lost the comments), I forgot TeX.
PDF (Portable Document Format)
A whole book could be written about PDF. In fact, one has. Several.
It's powerful, I guess, but it sure is complicated. PDF would have been successful 10 years earlier if reading/writing the format was easier. Even the commercial libraries that purport to import/read PDF files don't work very well, for the most part.
Part of this is the richness of the imaging model supported by PDF. But not all of it. There are too many options, too many compression schemes, a binary form, a non-binary form...
Enough said.
EPSF (Encapsulated PostScript)
EPSF is a file format that I designed myself, back in about 1987, when I ran Adobe's Developer Program, yet I will take potshots at it, for the sake of argument.
PostScript is (was?) a programming language, and as such, didn't make for a great file format. But there was a strong need to include PostScript "clip art" into larger pages, composed by PageMaker and all the page layout apps that followed.
Since PageMaker and the rest could not be expected to interpret the PostScript, there was a separate set of metadata that accompanied the PostScript file that allowed it to be "placed". The metadata included a bitmap preview of the graphic (so it could be placed in a relatively WYSIWYG way), plus bounding box information, font information, etc.
This extra metadata was embedded in the header of the file with special comment syntax, like this:
%%BoundingBox: 0 0 612 792
A line-oriented file format, easy to parse, easy to use, but somewhat error-prone. It's been in continuous use for 18 year so it can't be completely broken, I suppose.
RTF (Rich Text)
Rich text file format has a structure to it with open close { } braces to delineate sections. Suitable for whole files, streams poorly, syntax errors have wide side-effects.
SGML / HTML
Embedded tags in a flow of text. The tags imply mode changes that are sticky until the tags are closed.
How Dare I?
Many of the file formats I've designed have been in use for over a decade. Most have been through multiple revision levels and are backward- and forward-compatible (you can read an iMovie 4 project into iMovie 1, though it obviously won't understand and preserve all of what's in there).
The whole reason I've established this site is to call attention to the Emperor's New File Format and spark a conversation about information design. XML is not a good file format, yet it is widely used. Let's come up with something better.
XML is not extensible
Extensible means (to me) that it can be extended beyond its original design scope by adding new mechanisms.
I claim that this is not the case. XML has pre-defined syntax (begin/end tags with attributes that can be set within a tag). As such you can define any tags you want, and add any attributes you want, but that's not extensibility, it's in the original design.
There's no way I can see to extend the format without rewriting all the existing XML parsers.
XML is not a markup language
I am one of the world's leading experts on markup languages. I'll start there. I'm a 20-year veteran of desktop publishing, am personally related to the author of one of the very first markup languages in the world (Scribe), and have actually used SGML, MML, HTML, and most of the other markup languages that came along decades before XML.
So I know what I'm talking about. XML is not a markup language.
A markup language is predicated on the idea that the markup is an exception in a river of text. That is, the markup is a departure from the state that existed at the time the markup was encountered.
One of the first instances of this was the TROFF mechanism in UNIX, used for formatting "man pages". A simple example was that a line that started with .i was italic. So you might format a sentence with an italic word in it like this:
Here is an
.i emphasized phrase
and back to normal text
The same basic approach is used in HTML, except that it's not line-oriented, so you need a "close delimiter" other than carriage return (which is actually a pretty handy closing delimiter, but I digress). So the same thing in HTML is:
Here is an <i>emphasized phrase</i> and back to normal text.
The idea of markup is that you literally mark up a text, "circling" things, if you will, giving instructions to the typesetter (or parser, or other) that this snippet of text is to be treated somehow differently.
Another tenet of a markup language is that only the syntax is specified. The semantics of what the markup means is implicit (HTML) or described earlier (Scribe) or some combination of the two (CSS).
But here's the real kicker: a pure ASCII text file is a valid example of any markup language. That underscores the notion that the markup is a departure from the river of text. So a plain text file is technically a valid HTML file (though they ruined that purity with XHTML and CSS by requiring tags in it, but that's because they too didn't really know what a markup language was).
Heavyweight Parser
Any data contained in a file needs to be "parsed" back out. You open the file, you read it in, recognizing the file format attributes along the way, and look for what you need.
XML parsers are "fully general", in that they know how to recognize tags in general, and pull out the data in between, but they don't know what the data are all about. They're fairly big beasts, consume memory, take time to initialize, and you can't just whip one up yourself in an hour or two.
Furthermore, you have to teach it how to extract the one piece of data you want, or to read the whole thing in (as in the MacOS X parser, which gives you an NSDictionary), pick out your data, and throw the whole thing away. Very expensive and time-consuming operation, and it fails silently (and often) if there's anything amiss in the data itself.
By contrast, a line-oriented file format can be parsed with five lines of code, using "fgets" and "sscanf" to look for the data you need, and you can skip anything that's not interesting. Very, very fast, zero memory use, and no overhead.
So think carefully about who will be reading the data, and why, and design a file format that suits their needs. My bet is that 8 times out of 10, XML is not the right format.
XML as a "container"
There is one big problem with XML as a container. Its syntax, which is borrowed from HTML and SGML, involves angle brackets and a begin/end paradigm. The problem with this is that you can't embed similar data inside the XML file without escaping all the angle brackets. That gets messy very fast. It also is impossible to nest to arbitrary depth. That is, you can't have an XML file that contains an XML file that contains an HTML file without knowing beforehand how many times to un-escape the data when parsing it.
It also makes it essentially impossible to embed binary data in an XML file because you can't know whether or not to escape the XML sequences within the binary data (you should NOT, if the binary data is to be respected).
This is a classic problem with file formats which require parsing of the data and in which the delimiters themselves might be embedded. You have to recognize nested delimiters and/or escape them.
There are many other approaches to file formats which might have been better choices. For example, instead of a begin/end paradigm, specifying type and length data allows unambiguous parsing. It is not, however, easy to compose by hand, which is probably why it's not used.
Another approach is to simply have characters that are considered illegal in a data stream, and use those as delimiters. This is how C strings are represented (the illegal character is a byte with value 0): they're called null-terminated strings. This approach has been used widely for decades and has its advantages.
The bottom line is that syntactically XML is not a particularly good choice as a container format, and yet that is how it is most often used.
What started it all...
Thus was born the idea to create a blog devoted to what's wrong with XML. I'm not sure how much growth there will be in it, but (not surprisingly) the URL "xmlsucks.com" was available, so I jumped on it, shall we say.
Welcome to my rant.