Monday, December 11, 2006

File Formats

Scribe


According to Wikipedia's entry on Markup languages, Scribe is the first markup language to make the distinction between structure and presentation.

Scribe was my brother Brian's thesis project at Carnegie Mellon way back before anybody had ever heard of any of this stuff (the project was begun in 1976). It was later productized by Unilogic (around 1985). Scribe used syntax like @b(phrase) to mark bold text, for example, or more importantly, @head(Heading) which decoupled the semantic concept of a "heading" from specific font/size details. There was also a concept of style sheets so you could define what "italic" meant in a separate place.

Scribe had output drivers for troff, some plotters, and laser printers (one of the first PostScript drivers in the world was coupled with Scribe, and in fact the first two Adobe books, the "red book" and "blue book", were typeset with Scribe). There are few remnants of Scribe remaining, though I found an Internet RFC document type for Scribe documents (from 1991) and an old PostScript driver optimization case study from 1992 (oddly, written by yours truly).

Some more history on Scribe is here and here.


TeX / LaTeX


As Steve Hirsch noted in a comment on this blog (before I moved it and lost the comments), I forgot TeX.

TeX I think came after Scribe (I should check my history here but I'm too lazy).  TeX was invented by Don Knuth at Stanford to help solve the problem of typesetting mathematics, which was (and still is) very hard to do. Coupled with Leslie Lamport's LaTeX macros (which were modeled on Scribe) it is a very powerful markup language, specific to typesetting, as many of the early markup languages were.

More on TeX at the User's Group link.


PDF (Portable Document Format)


A whole book could be written about PDF. In fact, one has. Several.


It's powerful, I guess, but it sure is complicated. PDF would have been successful 10 years earlier if reading/writing the format was easier. Even the commercial libraries that purport to import/read PDF files don't work very well, for the most part.


Part of this is the richness of the imaging model supported by PDF. But not all of it. There are too many options, too many compression schemes, a binary form, a non-binary form...


Enough said.

EPSF (Encapsulated PostScript)


EPSF is a file format that I designed myself, back in about 1987, when I ran Adobe's Developer Program, yet I will take potshots at it, for the sake of argument.


PostScript is (was?) a programming language, and as such, didn't make for a great file format. But there was a strong need to include PostScript "clip art" into larger pages, composed by PageMaker and all the page layout apps that followed.


Since PageMaker and the rest could not be expected to interpret the PostScript, there was a separate set of metadata that accompanied the PostScript file that allowed it to be "placed". The metadata included a bitmap preview of the graphic (so it could be placed in a relatively WYSIWYG way), plus bounding box information, font information, etc.


This extra metadata was embedded in the header of the file with special comment syntax, like this:


%%BoundingBox: 0 0 612 792


A line-oriented file format, easy to parse, easy to use, but somewhat error-prone. It's been in continuous use for 18 year so it can't be completely broken, I suppose.

RTF (Rich Text)


Rich text file format has a structure to it with open close { } braces to delineate sections. Suitable for whole files, streams poorly, syntax errors have wide side-effects.

SGML / HTML


Embedded tags in a flow of text. The tags imply mode changes that are sticky until the tags are closed.

How Dare I?

I have designed, and parsed, quite a large number of file formats over my 30-year career, everything from line-oriented formats like troff, to Scribe, SGML, MML, etc. I designed Adobe's PPD and EPS file formats 18 years ago or so. iMovie's project format. iPhoto's albumlist (which in fact is in XML format). Dozens of little things in between.

Many of the file formats I've designed have been in use for over a decade. Most have been through multiple revision levels and are backward- and forward-compatible (you can read an iMovie 4 project into iMovie 1, though it obviously won't understand and preserve all of what's in there).

The whole reason I've established this site is to call attention to the Emperor's New File Format and spark a conversation about information design. XML is not a good file format, yet it is widely used. Let's come up with something better.

XML is not extensible

XML does not deserve the "X" in its name.


Extensible means (to me) that it can be extended beyond its original design scope by adding new mechanisms.


I claim that this is not the case. XML has pre-defined syntax (begin/end tags with attributes that can be set within a tag). As such you can define any tags you want, and add any attributes you want, but that's not extensibility, it's in the original design.


There's no way I can see to extend the format without rewriting all the existing XML parsers.

XML is not a markup language

XML does not deserver its "ML", or even its "X". But first, the "ML" part.


I am one of the world's leading experts on markup languages. I'll start there. I'm a 20-year veteran of desktop publishing, am personally related to the author of one of the very first markup languages in the world (Scribe), and have actually used SGML, MML, HTML, and most of the other markup languages that came along decades before XML.


So I know what I'm talking about. XML is not a markup language.


A markup language is predicated on the idea that the markup is an exception in a river of text. That is, the markup is a departure from the state that existed at the time the markup was encountered.


One of the first instances of this was the TROFF mechanism in UNIX, used for formatting "man pages". A simple example was that a line that started with .i was italic. So you might format a sentence with an italic word in it like this:


Here is an
.i emphasized phrase
and back to normal text

The same basic approach is used in HTML, except that it's not line-oriented, so you need a "close delimiter" other than carriage return (which is actually a pretty handy closing delimiter, but I digress). So the same thing in HTML is:


Here is an <i>emphasized phrase</i> and back to normal text.

The idea of markup is that you literally mark up a text, "circling" things, if you will, giving instructions to the typesetter (or parser, or other) that this snippet of text is to be treated somehow differently.


Another tenet of a markup language is that only the syntax is specified. The semantics of what the markup means is implicit (HTML) or described earlier (Scribe) or some combination of the two (CSS).


But here's the real kicker: a pure ASCII text file is a valid example of any markup language. That underscores the notion that the markup is a departure from the river of text. So a plain text file is technically a valid HTML file (though they ruined that purity with XHTML and CSS by requiring tags in it, but that's because they too didn't really know what a markup language was).

Heavyweight Parser

The contents of XML files varies a lot, of course. And the need to parse them varies accordingly. But a fairly common scenario is to "need just one piece of data" that's contained in an XML file somewhere. How do you get it?


Any data contained in a file needs to be "parsed" back out. You open the file, you read it in, recognizing the file format attributes along the way, and look for what you need.


XML parsers are "fully general", in that they know how to recognize tags in general, and pull out the data in between, but they don't know what the data are all about. They're fairly big beasts, consume memory, take time to initialize, and you can't just whip one up yourself in an hour or two.


Furthermore, you have to teach it how to extract the one piece of data you want, or to read the whole thing in (as in the MacOS X parser, which gives you an NSDictionary), pick out your data, and throw the whole thing away. Very expensive and time-consuming operation, and it fails silently (and often) if there's anything amiss in the data itself.


By contrast, a line-oriented file format can be parsed with five lines of code, using "fgets" and "sscanf" to look for the data you need, and you can skip anything that's not interesting. Very, very fast, zero memory use, and no overhead.


So think carefully about who will be reading the data, and why, and design a file format that suits their needs. My bet is that 8 times out of 10, XML is not the right format.

XML as a "container"

XML is most often used as a kind of container to hold structured data of some kind. The semantic nature of the data is not defined by XML itself, but typically is carried separately as a data definition or simply by being programmed into the model itself, which is the more common approach (e.g. "this XML file contains preference data" or "this XML file contains a Technorati Ping").


There is one big problem with XML as a container. Its syntax, which is borrowed from HTML and SGML, involves angle brackets and a begin/end paradigm. The problem with this is that you can't embed similar data inside the XML file without escaping all the angle brackets. That gets messy very fast. It also is impossible to nest to arbitrary depth. That is, you can't have an XML file that contains an XML file that contains an HTML file without knowing beforehand how many times to un-escape the data when parsing it.


It also makes it essentially impossible to embed binary data in an XML file because you can't know whether or not to escape the XML sequences within the binary data (you should NOT, if the binary data is to be respected).


This is a classic problem with file formats which require parsing of the data and in which the delimiters themselves might be embedded. You have to recognize nested delimiters and/or escape them.


There are many other approaches to file formats which might have been better choices. For example, instead of a begin/end paradigm, specifying type and length data allows unambiguous parsing. It is not, however, easy to compose by hand, which is probably why it's not used.


Another approach is to simply have characters that are considered illegal in a data stream, and use those as delimiters. This is how C strings are represented (the illegal character is a byte with value 0): they're called null-terminated strings. This approach has been used widely for decades and has its advantages.


The bottom line is that syntactically XML is not a particularly good choice as a container format, and yet that is how it is most often used.

What started it all...

I first blogged about XML a while ago and started to catch grief about this unpopular point of view. I've been defending it more and more over the past few months. Yesterday I was on a panel with Steve Gillmor who has an initiative entitled "Attention.xml", so naturally he wanted to give me grief about it as well.


Thus was born the idea to create a blog devoted to what's wrong with XML. I'm not sure how much growth there will be in it, but (not surprisingly) the URL "xmlsucks.com" was available, so I jumped on it, shall we say.


Welcome to my rant.