Well Adam, thanks for pointing both George Ou and John Carrol out. How fair and balanced of you. In the recent past, Adam has asked me if there was anything that XML could do that CSV couldn't. I'm been working on a longer post about that, but George managed to set me off big about this topic, so I've inlined those ideas here.
Here's my answer to George's FUD. Have you, personally, ever tried to shred and reassemble an MS word document from its shredded form? If you had, I'd love to know the following:
- How did you choose to normalize it and to what degree?
- How long did you spend writing the logic to shred it to normal form?
- How long did you spend writing the logic to pull it back out?
- How good is your fidelity to the original?
Hum, not so easy, is it? So now I'll take a turn answering those questions in the form of what SQL Server 2005's XML Data Type can help you do.
How did you choose to normalize it and to what degree?
I wholly disagree with some people (whom have very much earned by respect because of their research and results in this topic space) that normalization is over-rated. Very clearly its the best tool we have going today for managing tabular data with relationships. Documents aren't that in and of themselves, so normalizing them makes no sense to me in the first place. Documents, although they may have many datum within them, are semantically atomic. That is, the more you try to break them up into smaller parts, the harder you make it on both the management process and the end-user to re-establish their meaning. In this sense, storing any document regardless of its format passes the first normal form test of no duplicate data. You do need to have, of course, a primary key value for the document to be related to at the row level. That is hard if you're going to insist you can only use actual datum values as the primary key value since non-scalar types don't given much in terms efficiency. However, if you're willing to have surrogate key -- like you'd find in most data warehouses -- that issue is easily solved.
By the way, this also passes muster for the second normal form since all of the attributes are functionally dependent on the primary key. It also passes for third normal form since, at least order to preserve the meaning of the document, values not dependent on the primary key have been eliminated. Most folks that I've dealt with have said that third normal form is good enough for general use.
So fine, you ask, why is preserving meaning so important? Consider the following example, then think about how you might choose to store and retrieve it.
Data [1]: <a>Operations Manual for Machine X-01</a><warning>Do <bold>not</bold> push the Red button on Console tagged CMD-1. The machine will <bold>explode.</bold></warning>
Query [2]: //warning
You certainly wouldn't want the following coming back out, right?
Result [3]: <warning>Do push the Red button on Console tagged CMD-1. The machine will <bold>not</bold><bold>explode.</bold></warning>
This is where XML technologies actually provides something at a lower cost than many other choices. For both XPath and XQuery, the default return order of a sequence is document order. That means that if I store [1] and do a default query [2] I get back results that preserve the semantics of the data.
How long did you spend writing the logic to shred it to normal form?
For the simple example shown in [1], writing a parser that gives me each word and its index relative to the start of the start of the document isn't that hard. I could likely do one in Perl in a few minutes of coding and testing. But then we're not talking limiting ourselves to documents that are just that simple, we're talking about Word documents that have styles and formatting and TOC entries and a literally dozens of other variances that make writing not such a trivial exercise. Of course, it could be easier if the Word APIs were exposed to Query Processor so you could leverage the work already done by Microsoft. But that assumes that you'd be willing to embrace SQLCLR to do that. In any case, you're going to have at least some time spend doing analysis, some time spent doing design, time spent coding and testing. And let's not forget that you'll need to keep that code up to date. Better plan at least a few hours of work in case. And don't forget that when your parser is done, you still need to write the logic that loads the data and uses it to store the data.
And how much time do I have to spend writing a parser to feed an XML formatted Word Document into an XML typed column. Zero. True, I do have to write the logic load the document. That takes one T-SQL statement for me [4].
Query [4]: insert into namespace.table_name(column_name) select * from openrowset(bulk N'path_to_file',SINGLE_BLOB) as document
How long did you spend writing the logic to pull it back out?
I really have to wonder if George or anybody else that's on his bandwagon really has ever tried doing this. And if they have, would they choose to do it again? I know I have, and I know I'd not choose to it again. Getting the words out in the right order? Easy, since my design had each word and each bit of punctuation as elements in a list, and I had a column that stored the index relative to the beginning of the document, and other column that stored the index relative to the first word of a paragraph. I also had a column that referred to another table that described formatting codes. There was yet another table, with about a dozen rows per document, that I stored properties in.
Initially, my goal was to make easy on myself to generate a traditional index of the document. Pretty easy task, since I knew exactly where every word was and could mathematically figure out what page it was on. Sure, I had to filter off a lot of noise words, but it was pretty accurate. But giving the user back the whole document... not so easy, nor was it terribly performant, since as you can imagine, there was a lot of joins to be done just to get the formatting back. I don't recall having every addressed how to store trivial things like pictures, or tables, let alone complex things like charts. I'm also pretty sure that it was this project that pretty much got me started drinking straight Whiskey regularly.
So how does using XML as the document format make this easier? In and of itself, it doesn't. But the fact that I have stored as XML means I can use XPath and XQuery to access the whole document or any part of it much like I would using T-SQL queries over a table but without the coding and upkeep complexities of those joins. It also means that I don't have to care as much about what the objects are in the document, provided they are also serialized to inline XML (e.g, SVG for charts) With a good understanding of XQuery and a well thought XML Indexing strategy, I have reasonably performing queries at a fraction of the development cost that I spent doing this in T-SQL.
How good is your fidelity to the original?
One of the biggest problems I had my solution was that while I could pull the right words in the right order with the right formatting out as a series of paragraphs, the documents I produced never compared correctly to the original (thus always leaving some doubt in others minds that my solution really worked), nor would it always load right into word. The major problems I had to deal with is that everybody working on the project seemed to have a hard time formatting their inputs correctly. Some people would double space after a period. Others would indent with a tab while others used spaces. So on and so on and so on. Toward the end, my parsing program had more exceptions to the rules it implemented than it had implemented rules. But it was specifically because I had to normalize the text to a normalized database that, in the end, I was never able to achieve 100% output fidelity with the original. Reinventing the data model to fit every exception is rare luxury indeed.
I wish I could say that the XML Data Type in SQL Server solved this problem. It doesn't. Like I had to do, it also intentionally discards some of the information from the document in favor of its data model, the XML Infoset. The good news is that what typically gets discarded is far less likely to cause a noticeable fidelity issue that you could probably achieve rolling your own solution. For example, XML InfoSet normally discards insignificant white space in node values: e.g, <a> four score and seven years ago </a> becomes <a>four score and seven years ago</a>. There's switch to override this specific behavior, too.
Issues like the trimming of insignificant white space shouldn't be downplayed though. There are many use cases, such as where one is storing the typical memo-from-the-management type of document that insignificant white space removal isn't an issue at all. However, if you're storing the poetry of e. e. Cummings, you'd better be more careful since he is well known for using "insignificant" white spacing for its semantic effect. Doubtful, you say? Go look at "i have found what you are like."
For me, there's essentially five tests that a scenario needs to significantly pass in order to be a "good" use case of storing XML documents in a database in the first place:
- Is the document worth saving in the first place? It never ceases to amaze me how much information gets through into any system simply because its there rather than because its a right thing to do. How many organizations created Intranets just as pretty file storage? How much stuff get dumped into SharePoint sites simply to never see the light of day again? Again, don't get me wrong, of course there's value to being able to find information when you need it, but one thing I will agree with George about is that, as a culture, XML is getting used to resell caskets for the data graveyard. That's not the technologies fault though. The crux of this test is what business value is generated by committing the document to database. Committing data that never gets used is a waste regardless of its format.
- Do any of the other benefits of SQL Server storage get something don't have or have as well elsewhere else? This shouldn't be hard test to administer: Does SQL Server give a better backup and restore scenario than you have now? Do you get some new or extra security from having the data in SQL Server rather than elsewhere? The more or the more strongly you answer these questions and questions like them yes, the better choice storing the document in SQL Server becomes.
- Do you need to integrate information stored in document into other query results? To me this speaks directly towards the value of using XML over many other formats. I believe you'd be hard pressed to find a technology supported by SQL Server that more effectively and efficiently allows developers and users to leverage the information in a document as part of a query than XML and XQuery. Sure, SQL Full Text Search works great but it can't do the kind of path matching that XPath does. Sure, you use the string functions of Text BLOBS, but that's typical more code that less efficient than what you can achieve with XQuery, I think. Two warnings though: first, the efficiency of XQuery depends greatly on having at least a primary XML index in place and benefits from having the right kind of secondary XML index in place as well. In effect, your trading disk space for CPU time (which is true of any index). Second: because XML is a composite of many scalar values -- especially when you're speaking of storing documents as XML -- processing queries against are simply going to be slower than processing queries against scalar types. In other words, even with lighting fast disks, huge amounts of RAM and all the indexes you could want, never expect a query against XML typed data to be better than you get against any other data type. In most cases, it will be at least an order or magnitude slower at least.
- How much or little DML do you perform against the document? Given the issue I just covered, I wouldn't advise storing documents in a database and expecting them perform well at all if you are perform any significant amount of inserts, updates and deletes against at the node level using XQuery. Although the performance of such operations has dramatically improved since the early betas of SQL Server 2005, the "one-change-at-time" and the locking issues (both the hosting row and the indexes for the instance are locking during XMLDML operations) can become performance issues quickly. My suggestion is that if you have a lot of volatile information you want to incorporate into documents, don't store the documents as XML. Rather, try to generate them using FOR XML Path queries against scalar data in normalized tables.
- Do the benefits of storing it as XML exceed the costs of storing of XML relative to those factors for a BLOB? As powerful and as attractive as XQuery and XML Indexing makes the use of XML documents stored in the database, there's a few gotchas. First, because text data is stored internal as UTF-16, its not uncommon to see a "doubling" of the documents size when stored. Of course, that effective is regional and if you're storing your documents encoding in UTF-16 on the file system anyway, chances are, you won't notice this. But if you're like me, you might stunned to see disk space go by 180k when you store a 100k document in the database. What's worse is that each document added to an XML column covered by an primary XML index typically grows by 1.5x to 2.5x the size of the document. So its not hard to have a 100k document taking up 400k or 500k of disk space when stored. Of course, you do get a fair amount of extra functionality over say, storing the document in XML Format as a VarChar(max). However, if disk space is a concern, like it is for some SKUs of SQL Server 2005, you might be better off not using it, but rather, storing the XML as large text blob, casting it to XML and then processing it. You'll lose the performance help of the Indexing, of course.
Don't get me wrong though. These answers apply specifically and only to document-centric XML as generated by Microsoft Word. Do some of these apply to other use cases? Sure, but not all of them. And I'd certainly say that if you're storing tabular data in Word document, you've got serious misunderstandings of the appropriate use cases between a Document and Database.
So armed with this knowledge, let's see if we can answer some of Georges questions and objections.
Well that's all fine and dandy but how is this a good thing if I've got a business to run?
I hope you're trying to be funny George, because if actually got a business to run and your down at this trying to get down to this level of depth on a technical issues, either your doing a crappy job running your business and don't have anything else to do or your business is selling relational databases that don't have XML supporting features. Yes, I know, you simply rattled this list of to get our attention and help spread the anti-XML FUD.
What does XML do for me that I can't do with existing binary formats and existing RDBMS formats?
- In the context of SQL Server 2005, it allows me to avoid much of the coding work of loading that data into a processing system.
- It allows me to efficient and effective query documents (either on the whole or by parts) and join those results into other queries.
- It allows me to, in platform and vendor neutral way, share documents, techniques and technologies. Sure, SQL is standard. How many vendors have bothered to actually implemented their systems to support it? How many years does it take the average IT professional to become equally Expert in T-SQL and PL/SQL? At least XQuery looks to be "learn once, use many."
- It allows me to embed schema information inline that can be used to perform ordinal, cardinal and referential integrity checking prior to use and processing. Let's see CSV do that. In fact, I am looking for any information about any IT industry-wide body that's produced a standard for CSV period.
- In the SQL Server 2005 context at least, much of that same schema can be carried forward into the database to make sure the XML remains valid. Note that, however, the currently published XML schemas for office documents aren't supported by SQL Server 2005. Hopefully this can be quickly corrected.
What kind of ROI (return on investment) can possibly justify the massive retrofit of my entire IT infrastructure to be "XML-friendly"?
My oh my, you must really be scare of something to paint such a plainly wrong and misleading picture of reality. The fact is there's absolutely nothing caused by XML in and of itself that has driven the rapid consumption of disk space, increases demand for bandwidth anything else on your list of horrors. I suspect that if you go into any good sized shop, the drivers for increased disk storage is the increasing replacement of non-electronic files for electronic files. XML didn't cause that. Bandwidth consumption has gone up because the Internet is a valuable business tool. XML didn't cause that either. The need for XML compression gear is hardly proven for even for major enterprises, yet you make it sounds like the Mom and Pop gas station is going to have to decide between a $25,000 gateway or providing medical insurance for their employees. FUD, FUD, FUD and more FUD.
Probably the best ROI that most of us will ever get of XML, or any technology for that matter, is a couple of books and the time spent working with the it.
Oh, but it's human-readable! Umm, has anyone ever tried to read a raw XML document before they espouse the human-readability of XML? Maybe some Vi guru can call it human readable but I sure can't and I don't know of normal person who would. A simple CSV file that is at least 10 times smaller is infinitely more human-readable than XML.
Too easy. Which of these tells you more about the data represented?
Paragraph
Oh, but it's human-readable! Umm, has anyone ever tried to read a raw XML document before they espouse the human-readability of XML? Maybe some Vi guru can call it human readable but I sure can't and I don't know of normal person who would. A simple CSV file that is at least 10 times smaller is infinitely more human-readable than XML.
vs.
<paragraph>Oh, but it's human-readable! Umm, has anyone ever tried to read a raw XML document before they espouse the human-readability of XML? Maybe some Vi guru can call it human readable but I sure can't and I don't know of normal person who would. A simple CSV file that is at least 10 times smaller is infinitely more human-readable than XML.</paragraph>
Remember, you're the one that started by asking why XML was better for documents. And here's a question for you. How many paragraphs are there in the text? If CSV means "common separated values" there's two. One that starts with "Oh" and one that starts with "but." Where's the improved readability again?
By the way, the XML version is just over 4.14% larger than the CSV. 4.14% is a long ways from 10 times larger, isn't it?
No normal person is interested in reading raw XML even if they're guilty of spreading the myth of XML human-readability.
Sure, I'll grant you that reading XML isn't as much fun as reading Ted Kooser's poetry or James Bean's XML for Data Architects. But then, neither is reading FUD like this.
What then is achieved by taking a massive detour with bloated XML documents when they ultimately have to be processed into something else before they can be parsed by the CPU or human?
- The ability to de-couple the data from the presentation. If you don't see that value in that then how can you possibly justify even using database to begin with? Or CSV for that matter? XML just happens to have the benefit being designed from the ground up to support Unicode and mark-up, two things that make the de-coupling process lots easier to cope with in the 21st century.
- The efficiency of writing the programs that leverage existing, universal pattern based frameworks. That I can have a Saxon and DOM Java Programmer be productive a few days with C#, XmlDocument and XmlReader/XMLWriter has obvious business value. It is too bad that nobody decided to push for these aspects for CSV as a real standard. Its also sad that either Sun or Microsoft has added first-class CSV APIs to the frameworks. What's the sense in crying about that now. Oh, that's right, it helps you spread your FUD.
- The effectiveness and efficiency of having data expressed in a format that more than one program can use, that more than one version of the program can use and that the same programs running on different platforms with different CPU byte structurings can use. The duh factor on this should be high: if the data isn't sustainable, the applications aren't going to be either.
Oh and just for the record, George, I don't believe that XML caused Hurricane Katrina, is a terrorist plot to gunk up our networks or is part of a conspiracy by the CIA to condition us for alien assimilation either. Harder to prove those things either way. But -- and just in case -- please don't take that as advice from me that its safe to take your Aluminum hat off. I might be wrong.
And just for the record, I don't we should call XML self-describing unless there's both a schema reference or in-line and and there's OWL instance documents to back it up.