Monday, December 11, 2006

Heavyweight Parser

The contents of XML files varies a lot, of course. And the need to parse them varies accordingly. But a fairly common scenario is to "need just one piece of data" that's contained in an XML file somewhere. How do you get it?


Any data contained in a file needs to be "parsed" back out. You open the file, you read it in, recognizing the file format attributes along the way, and look for what you need.


XML parsers are "fully general", in that they know how to recognize tags in general, and pull out the data in between, but they don't know what the data are all about. They're fairly big beasts, consume memory, take time to initialize, and you can't just whip one up yourself in an hour or two.


Furthermore, you have to teach it how to extract the one piece of data you want, or to read the whole thing in (as in the MacOS X parser, which gives you an NSDictionary), pick out your data, and throw the whole thing away. Very expensive and time-consuming operation, and it fails silently (and often) if there's anything amiss in the data itself.


By contrast, a line-oriented file format can be parsed with five lines of code, using "fgets" and "sscanf" to look for the data you need, and you can skip anything that's not interesting. Very, very fast, zero memory use, and no overhead.


So think carefully about who will be reading the data, and why, and design a file format that suits their needs. My bet is that 8 times out of 10, XML is not the right format.

2 comments:

  1. but doesn't fgets and sscanf parse the file within thier implementation just as an XML parser does. Would you write your own fgets or sscanf implementation? Your argument olny seems to hold water at your level of abstraction. If you want to go lower down the layers of abstraction I can put forward the exact same arguements against your solutions as you have against XML.

    ReplyDelete
  2. Sorry for posting manifestations of complete disagreement to 60% of the posts I read on your site...

    What you said is absolutely true for a DOM Parser. But there is more to XML parsing, like the old (IMHO awful) SAX Parser, and the newer and much easier STAX / PullParser. Both read your document with constant memory footprint and without initialization worth mentioning.

    If you are about extracting a small piece of information from a large XML document, how about using XPath, XQuery (or maybe XSLT to format your output). Yeah, that means learning some more languages and using the appropriate libraries that can handle them, but it's not too hard.

    If you really want to implement a simple XML parser yourself, this can clearly be done in an hour, I did this in PHP about 5 years ago before I realized that there already is an XML parser for practically every language, and it's easier to use than rewriting it. Of course, my parser didn't honor namespaces corretly and was clueless about CDATA-secrtion, but hey...

    I've written parsers for an uncountable number of non-XML-formats and text based Internet protocols, and this turned out to be much harder than parsing XML because of nasty little details and exceptions which are also nasty in XML, but much nastier in every other language.

    ReplyDelete