Monday, December 11, 2006

XML as a "container"

XML is most often used as a kind of container to hold structured data of some kind. The semantic nature of the data is not defined by XML itself, but typically is carried separately as a data definition or simply by being programmed into the model itself, which is the more common approach (e.g. "this XML file contains preference data" or "this XML file contains a Technorati Ping").


There is one big problem with XML as a container. Its syntax, which is borrowed from HTML and SGML, involves angle brackets and a begin/end paradigm. The problem with this is that you can't embed similar data inside the XML file without escaping all the angle brackets. That gets messy very fast. It also is impossible to nest to arbitrary depth. That is, you can't have an XML file that contains an XML file that contains an HTML file without knowing beforehand how many times to un-escape the data when parsing it.


It also makes it essentially impossible to embed binary data in an XML file because you can't know whether or not to escape the XML sequences within the binary data (you should NOT, if the binary data is to be respected).


This is a classic problem with file formats which require parsing of the data and in which the delimiters themselves might be embedded. You have to recognize nested delimiters and/or escape them.


There are many other approaches to file formats which might have been better choices. For example, instead of a begin/end paradigm, specifying type and length data allows unambiguous parsing. It is not, however, easy to compose by hand, which is probably why it's not used.


Another approach is to simply have characters that are considered illegal in a data stream, and use those as delimiters. This is how C strings are represented (the illegal character is a byte with value 0): they're called null-terminated strings. This approach has been used widely for decades and has its advantages.


The bottom line is that syntactically XML is not a particularly good choice as a container format, and yet that is how it is most often used.

6 comments:

  1. Hi - the technique to encode files posted here ->http://www.stylusstudio.com/binary_xml.html# seems to illustrate a way to store binary files in XML.

    ReplyDelete
  2. XML can store binary data perfectly fine utilizing either Hex, Base64, or Hoffman encoding. I have written a small Java library that demonstrates this using Base64. Next time do some research before slamming something!

    ReplyDelete
  3. Yup, base64 is a very effective hack for storing binary data inside hostile formats. It was after designed for transporting binaries through EBCDIC<->ASCII email translation filters.

    ReplyDelete
  4. Just to prove how effective it is; copy and paste this through linux's uudecode command ...

    begin-base64 644 smile1.gif
    R0lGODlhMAAwAPEDAAAAAN0JB/vzBf///yH5BA0KAAMALAAAAAAwADAAAAL+
    3Lhx48aNGzdu3Lhx48aNGzdu3Lhx48aNGzdu3Lhx48aNGzdu3Lhx48aNGzdu
    3Lhx48aNGzdu3Lhx48aNGzdu3LhRokSJGjdu3Lhx48aNGzdu3Lhx48aNGzdu
    lChRokSJEiVq3Lhx48aNGzdu3Lhx48aNGzdKlChRokSJEiVK1Lhx48aNGzdu
    3Lhx48aNGyVKlChRokSJEiVKlKhx48aNGzdu3Lhx48aNEiVKlChRokSJEiVK
    lChR48aNGzdu3Lhx48aJEiVKlChRokSJEiVKlChR4saNGzdu3Lhx40aJEiVK
    lChRokSJEiVKlChRosaNGzdu3Lhx40SJEiUChCj+UaJEiRIlAoQoUaLEjRs3
    bty4caNEiRIFAgQoUaJEiRIFAgQoUaJEjRs3bty4caNEiRIBAgQIUaJEiRIB
    AgQIUaJEjRs3bty4caJEiRIBAgQIUaJEiRIBAgQIUaJEiRs3bty4caJEiRIB
    AgQIUaJEiRIBAgQIUaJEiRs3bty4UaJEiRIBAgQIUaJEiRIBAgQIUaJEiRo3
    bty4UaJEiRIBAgQIUaJEiRIBAgQIUaJEiRo3btw4UaJEiRIBAgQIUaJEiRIB
    AgQIUaJEiRI3btw4UaJEiRIFAgQoUaJEiRIFAgQoUaJEiRI3btw4UaJEiRIl
    AoQoUaJEiRIlAoQoUaL+RIkSN27cOFGiRIkSJUqUKFGiRIkSJUqUKFGiRIkS
    N27cKFGiRIkSJUqUKFGiRIkSJUqUKFGiRIkSNW7cKFGiRIkSBUqUKFGiRIkS
    JUoUKFGiRIkSNW7cKFGiRIkSBUqUKFGiRIkSJUoUKFGiRIkSNW7cKFGiRIkS
    AUqUKFGiRIkSJUoUKFGiRIkSNW7cKFGiRIkCIUqUKFGiRIkSJUoUCFGiRIkS
    NW7cKFGiRIkAJUqUKFGiRIkSJUoUCFCiRIkSNW7cKFGiQIAAIUqUKFGiRIkS
    JUqUCBAgQIkSNW7cOFGiRIkSAUqUKFGiRIkSJUoUCFGiRIkSN27cOFGiRIn+
    EgVClChRokSJEiVKBChRokSJEjdu3DhRokSJEiUClChRokSJEiUKhChRokSJ
    Ejdu3DhRokSJEiUChChRokSJEiUChChRokSJEjdu3LhRokSJEiUKBAhRokSJ
    EgEClChRokSJGjdu3LhRokSJEiVKBBgQIECAAANClChRokSJGjdu3LhxokSJ
    EiVKFJgQIECACANKlChRokSJGzdu3LhxokSJEiVKlIgwYcKECSFKlChRokSJ
    Gzdu3Lhxo0SJEiVKlCgQYcKEACVKlChRokSNGzdu3Lhxo0SJEiVKlChRIECA
    EiVKlChRokSNGzdu3Lhx40SJEiVKlChRokSJEiX+SpQoUaLEjRs3bty4ceNG
    iRIlSpQoUaJEiRIlSpQoUaLGjRs3bty4cePGiRIlSpQoUaJEiRIlSpQoUeLG
    jRs3bty4cePGjRIlSpQoUaJEiRIlSpQoUePGjRs3bty4cePGjRslSpQoUaJE
    iRIlSpSocePGjRs3bty4cePGjRs3SpQoUaJEiRIlStS4cePGjRs3bty4cePG
    jRs3bpQoUaJEiRIlaty4cePGjRs3bty4cePGjRs3bty4UaJEiRo3bty4cePG
    jRs3bty4cePGjRs3bty4cePGjRs3bty4cePGjRs3bty4cePGjRs3bty4cePG
    jRs3bty4cePGjRs3btwbuHHjxo0bN27cuHHjxo0bN27cuHHjxo0bN24FADs=
    ====

    ReplyDelete
  5. At any point where you reach for base64, you should realize that something is going horribly wrong. Sometime it's just legacy EBCDIC / 7bit crap and you've got no choice but.... *cough*

    The problem with XML is that programmers like being able to toss a file into a text editor and debug things, without needing to write a generalized binary viewer. So being fully-human-readable and human-tweakable was valued more than being a perfect container.

    I don't tend to think that using an illegal character is any better of an option than type and length information for preserving the ability to hack up an XML file without binary tools. You need to be able to work with it in an unmodified text editor -- you might as well just write a very good XML-like format with binary serialization.

    I think the problem with an XML container is that you CAN make it work without unescaping or other theatrics, but you MUST make sure that the data you are encapsulating is also well-formed. Which tends to be too much to ask from folks.

    I think the other problem is that nobody cooked up a notion of a very-standardized tar/jar/zip file so you could have text-XML documents or text-XML + wrapped binary documents and support them in a useful fashion.

    Kinda like being able to save a file that's just a MIME multipart.

    ReplyDelete
  6. It is an absurd argument to say that "binary data can easily be included by converting it to ASCII data". It is no longer binary data at that point.

    Right? Right?

    ReplyDelete