A co-worker of mine, John Nielsen, forwarded me this post from xmldatabases.org, and I felt compelled to respond.  John is an extremely talented developer and author in the Python space.  As you might expect, he is also a little progressive in his thoughts about Computer Science.  I don’t think he sees Microsoft and Bill Gates as the Anti-Christ, but he does have a more open source bent. 

 

John and I have had many discussions about where things are going in technology, especially in regards to SOA and databases, two of my favorite technology topics.  Anyway here are some thoughts about this post.  I’ve included my thoughts in italics and bold.

 

1.      Do not think in terms of programming languages, XML is a data format. It is NOT a serialization of programming language structures so don't treat it as such. Yes, we learned this in Relational Databases long ago (well I guess I know some developers that haven’t, but on the whole…) In REST based systems you're moving and manipulating data, you're not making method calls against a remote object. The fact programming languages are involved at all is a necessary evil. The data is what matters, programming languages are just tools to help work with that data.  Yes!  Finally someone gets it!

2.      XML Schemas are to act as documentation and spot validation, do not use them as a straight jacket around the data. W3C XML Schema driven systems are horribly brittle in operation because they don't bend. If the data doesn't exactly match what is expected they blow up, even on something as simple and non-critical as an extra attribute from a different schema being added to an element. Yes validate the data when you need to, but do not overly constrain it.  This is a recipe for disaster and is a major flaw for XML databases.  In the relational database management world, we know that if you don’t constrain the data to the business rules, data quality will suffer.  XML is brittle and does not provide good constraining capabilities.

3.      For the data you receive just take what you need and pass through everything else. If there's extra data in the stream it is NOT an error as long as all the other data you need is there as well. Just ignore the extra data. This is one reason why XML Schema based systems are so fragile, they take too literal of a view about the shape of the data.  Of course, you could just limit the data you request, such as is possible with a good data access language like SQL.  XPath is a good stab at a declarative language for XML, but it still has some growing up to do.

4.      As a corollary to #3, it is also not necessarily an error when non-critical data is missing from the stream. If you can do a little extra work to make the operation succeed with missing data, do the work. Um yeah, it’s called a join in the set based world.  And the DBMS does all the heavy lifting for you.

5.      The network can and will disappear. Take advantage of the clean data packet XML provides to queue the request for later execution. Anytime there's no user sitting waiting for the data it's probably a good idea to use this kind of asynchronous architecture.   Transaction Logs are wonderful things!

6.      Your data is going to change, that's why you're using XML to begin with. Do NOT lead your clients down a path that leads them to overly coupling their systems to your changing data. I.e. do not encourage them to use some kind of XML data binding tool that's strongly coupled to a particular schema. This also means you need to be very selective in what you consider required data. Adding additional required data to an existing operation is going to cause breakage. If data is not absolutely 100% for certain required then just make it optional. Adding additional optional data should never break the system. Another reason XML Schema based systems are so horribly brittle is they tend to break on any schema change, not just required data changes.  Another area where relational/SQL based systems are still superior.  Changing data structures are easy, fast, and tend not to break well written applications. 

7.      Do not require more data then necessary in a request. Yet another limitation of schema based systems is that you can be forced to carry a bunch of context data with a request just so that it passes validation, even when the data is not truly required for the request.  See thoughts on point 3. 

8.      Use XML tools as much as possible. XPath and XSL-T are enormously powerful tools for working with XML data, traditional programming languages are generally very poor at working with XML. Trying to bend traditional programming language techniques to working with XML data is what leads to horrible systems like W3C XML Schema and static binding of programming language objects to schemas. If you want to build truly robust systems do not fall into this trap.  Yes these are powerful tools, but they have some catching up to do with SQL.  As a declarative language there are few things I CAN’T do to request, transform, and manipulate data.  Not all SQL implementations are good performers at everything that I need to do, but it’s possible and not difficult.

9.      Build software that bends. Think in terms of malleable data and disappearing services. In general, visualize the system as an evolving, changing entity that can not be constrained. Do NOT try to constrain it, if you do you'll just end up with a system that falls over at the smallest inconsistency. Validate what you need, pass through anything you don't need.  Again if you don’t constrain it, data quality will suffer.  If you are talking about programs that just pass the data through to an application, that has to be validated, too.  Often times I see people just shredding XML documents and stuffing them into relational databases, or <gasp> adding a LOB field and sticking the XML document in it.  If XML isn’t capable of constraining the data and there is no validation before sticking it into the database, problems will arise.  Your application might not break, but how about the next guy down the line or the business user that has to report on that data.  I pity the ETL developer or data steward in the Warehouse that has to cleanse that data.

10. Maximize flexibility, do not fall into the trap of believing that a fully specified and constrained system will be more robust. It won't be, it will be brittle and prone to constant operational failure in the real world. Accept that the system and the environment where it's going to operate is going to change and build for that eventuality.   Flexibility is often given up in lieu of scalability.  However, relational databases tend to give you a happy medium.

 

All this isn’t to say that I’m anti-XML.  On the contrary, I think it’s one of the most important technologies of the past decade.  It just isn’t a hammer and the world isn’t a nail.  I wouldn't use SQL to develop GUIs, would I (could I?...hmmmm.... - :)  ) Let’s use XML for what it was designed for – messaging and inter-process communications.  As an industry we ditched hierarchical and network databases many years ago (see IMS, IDMS, et al), and for good reason.   

 

Jon