Microsoft has just announced Office 2010. Surprisingly enough, it has genuinely interesting new features, most of them revolving around SharePoint and support for collaboration.
And, of course, The Cloud, where to Microsoft’s credit, alone among major software vendors its product makes serious use of the PC’s processing power instead of limiting its role to running a browser and Citrix client.
But the Office/SharePoint combo is missing something essential: A design methodology for the unstructured data stores it helps you manage.
For structured data, the rules came out of IBM in the early 1970s via two of its researchers, Edgar Codd and Chris Date. While IBM’s enormous IMS customer base made it a bit slow on the update compared to rivals like Oracle (IMS, in case you aren’t familiar with it, was/is IBM’s hierarchical DBMS), it’s fair to say DBMS vendors took responsibility for developing the methodologies needed to make their products useful.
When it comes to unstructured data, though, we don’t even have a precise definition to clarify what we mean when we use the term.
Here’s mine, for whatever it’s worth: Unstructured data is data whose meaning isn’t of interest to the computer programs that process it.
If that isn’t clear: A relational data table might have a “Date of Birth” field. The data in this field has computer-processable meaning … for example, by subtracting it from the current date a computer program can calculate the age of each individual in the database with a non-null entry, and can then average those ages if anyone wants it to.
Compare that to the previous paragraph, or this entire tirade. Microsoft Word can store, manipulate, format, and render the text; likewise programs like Photoshop, Final Cut and WavePad for other forms of unstructured data. They can make unstructured data presentable and searchable, but they don’t interpret, analyze, or summarize it in any way parallel to what we can do with what’s in a relational database.
And the closest we have to an RDBMS for organizing our unstructured data is technology that encourages us to create a folder tree. That’s it. And since the history of the database management system suggests that fixed hierarchical definitions should give way to dynamically linkable ones, it’s a good bet single fixed folder trees are something we should be trying to retire.
Who ought to be solving this problem? You’d sure think the Knowledge Management industry would have considered it Job One.
You’d think so, but it didn’t, and if it hasn’t by now, there’s no reason to imagine it will figure it out any time soon.
So I guess I’ll have to do it. If I knew anything about the subject, it would look positively daunting. Luckily, I don’t. So here goes.
The tricky part will be moving beyond the venerable folder tree. The good news: We have all the tools we need, in the form of an ability to create multiple sets of categories and sub-categories, and to assign an unlimited number of categories and sub-categories to any file.
Look at a list of categories and subcategories sidewise and it turns into multiple, parallel folder trees. Assign multiple subcategories to a document and voila! You can find it in multiple subfolders.
It’s just what we need. Now we just have to define and use the category trees — an exercise I’ll cheerfully leave to those who classify human knowledge domains.
What’s left is who.
For enterprise databases, professional DBAs do everything except entering the data. But there’s an enormous difference between enterprise databases and enterprise unstructured data, namely, this wouldn’t work at all.
First of all, while the volume of unstructured data is enormous, an unknown but significant fraction is ad hoc, short-lived, unimportant personal flotsam for which any enterprise-level management effort would be more stifling than helpful.
It’s at the other end of the scale that enterprise attention makes sense — for example, documents with multiple authors and reviewers that right now are managed so ineffectively in most organizations that at any given moment, nobody is even certain which version is the current source of truth — if one is — let alone where to find it.
But we sure don’t want to create a class of document definition professionals with DBA-like authority … professionals who would be the only people in the enterprise allowed to create a new document, define its table of contents, and assign the appropriate categories.
It’s establishing the category trees that should be an enterprise responsibility. Call them the Chart of Accounts for unstructured data. Control that carefully and provide ample education. Muddling through the remaining work will probably work pretty well.
Better, at least, than the alternatives.