Monday, December 11, 2006

File Formats


According to Wikipedia's entry on Markup languages, Scribe is the first markup language to make the distinction between structure and presentation.

Scribe was my brother Brian's thesis project at Carnegie Mellon way back before anybody had ever heard of any of this stuff (the project was begun in 1976). It was later productized by Unilogic (around 1985). Scribe used syntax like @b(phrase) to mark bold text, for example, or more importantly, @head(Heading) which decoupled the semantic concept of a "heading" from specific font/size details. There was also a concept of style sheets so you could define what "italic" meant in a separate place.

Scribe had output drivers for troff, some plotters, and laser printers (one of the first PostScript drivers in the world was coupled with Scribe, and in fact the first two Adobe books, the "red book" and "blue book", were typeset with Scribe). There are few remnants of Scribe remaining, though I found an Internet RFC document type for Scribe documents (from 1991) and an old PostScript driver optimization case study from 1992 (oddly, written by yours truly).

Some more history on Scribe is here and here.

TeX / LaTeX

As Steve Hirsch noted in a comment on this blog (before I moved it and lost the comments), I forgot TeX.

TeX I think came after Scribe (I should check my history here but I'm too lazy).  TeX was invented by Don Knuth at Stanford to help solve the problem of typesetting mathematics, which was (and still is) very hard to do. Coupled with Leslie Lamport's LaTeX macros (which were modeled on Scribe) it is a very powerful markup language, specific to typesetting, as many of the early markup languages were.

More on TeX at the User's Group link.

PDF (Portable Document Format)

A whole book could be written about PDF. In fact, one has. Several.

It's powerful, I guess, but it sure is complicated. PDF would have been successful 10 years earlier if reading/writing the format was easier. Even the commercial libraries that purport to import/read PDF files don't work very well, for the most part.

Part of this is the richness of the imaging model supported by PDF. But not all of it. There are too many options, too many compression schemes, a binary form, a non-binary form...

Enough said.

EPSF (Encapsulated PostScript)

EPSF is a file format that I designed myself, back in about 1987, when I ran Adobe's Developer Program, yet I will take potshots at it, for the sake of argument.

PostScript is (was?) a programming language, and as such, didn't make for a great file format. But there was a strong need to include PostScript "clip art" into larger pages, composed by PageMaker and all the page layout apps that followed.

Since PageMaker and the rest could not be expected to interpret the PostScript, there was a separate set of metadata that accompanied the PostScript file that allowed it to be "placed". The metadata included a bitmap preview of the graphic (so it could be placed in a relatively WYSIWYG way), plus bounding box information, font information, etc.

This extra metadata was embedded in the header of the file with special comment syntax, like this:

%%BoundingBox: 0 0 612 792

A line-oriented file format, easy to parse, easy to use, but somewhat error-prone. It's been in continuous use for 18 year so it can't be completely broken, I suppose.

RTF (Rich Text)

Rich text file format has a structure to it with open close { } braces to delineate sections. Suitable for whole files, streams poorly, syntax errors have wide side-effects.


Embedded tags in a flow of text. The tags imply mode changes that are sticky until the tags are closed.


  1. Hey, I appreciate you correcting the record! Classy!

  2. Oh, here's an article I wrote about XML. I originally wrote it in 1999, it's amazing how little has changed: