Unstructured data design – the missing methodology

Like Tweet Pin it Share Share Email

Microsoft has just announced Office 2010. Surprisingly enough, it has genuinely interesting new features, most of them revolving around SharePoint and support for collaboration.

And, of course, The Cloud, where to Microsoft’s credit, alone among major software vendors its product makes serious use of the PC’s processing power instead of limiting its role to running a browser and Citrix client.

But the Office/SharePoint combo is missing something essential: A design methodology for the unstructured data stores it helps you manage.

For structured data, the rules came out of IBM in the early 1970s via two of its researchers, Edgar Codd and Chris Date. While IBM’s enormous IMS customer base made it a bit slow on the update compared to rivals like Oracle (IMS, in case you aren’t familiar with it, was/is IBM’s hierarchical DBMS), it’s fair to say DBMS vendors took responsibility for developing the methodologies needed to make their products useful.

When it comes to unstructured data, though, we don’t even have a precise definition to clarify what we mean when we use the term.

Here’s mine, for whatever it’s worth: Unstructured data is data whose meaning isn’t of interest to the computer programs that process it.

If that isn’t clear: A relational data table might have a “Date of Birth” field. The data in this field has computer-processable meaning … for example, by subtracting it from the current date a computer program can calculate the age of each individual in the database with a non-null entry, and can then average those ages if anyone wants it to.

Compare that to the previous paragraph, or this entire tirade. Microsoft Word can store, manipulate, format, and render the text; likewise programs like Photoshop, Final Cut and WavePad for other forms of unstructured data. They can make unstructured data presentable and searchable, but they don’t interpret, analyze, or summarize it in any way parallel to what we can do with what’s in a relational database.

And the closest we have to an RDBMS for organizing our unstructured data is technology that encourages us to create a folder tree. That’s it. And since the history of the database management system suggests that fixed hierarchical definitions should give way to dynamically linkable ones, it’s a good bet single fixed folder trees are something we should be trying to retire.

Who ought to be solving this problem? You’d sure think the Knowledge Management industry would have considered it Job One.

You’d think so, but it didn’t, and if it hasn’t by now, there’s no reason to imagine it will figure it out any time soon.

So I guess I’ll have to do it. If I knew anything about the subject, it would look positively daunting. Luckily, I don’t. So here goes.

The tricky part will be moving beyond the venerable folder tree. The good news: We have all the tools we need, in the form of an ability to create multiple sets of categories and sub-categories, and to assign an unlimited number of categories and sub-categories to any file.

Look at a list of categories and subcategories sidewise and it turns into multiple, parallel folder trees. Assign multiple subcategories to a document and voila! You can find it in multiple subfolders.

It’s just what we need. Now we just have to define and use the category trees — an exercise I’ll cheerfully leave to those who classify human knowledge domains.

What’s left is who.

For enterprise databases, professional DBAs do everything except entering the data. But there’s an enormous difference between enterprise databases and enterprise unstructured data, namely, this wouldn’t work at all.

First of all, while the volume of unstructured data is enormous, an unknown but significant fraction is ad hoc, short-lived, unimportant personal flotsam for which any enterprise-level management effort would be more stifling than helpful.

It’s at the other end of the scale that enterprise attention makes sense — for example, documents with multiple authors and reviewers that right now are managed so ineffectively in most organizations that at any given moment, nobody is even certain which version is the current source of truth — if one is — let alone where to find it.

But we sure don’t want to create a class of document definition professionals with DBA-like authority … professionals who would be the only people in the enterprise allowed to create a new document, define its table of contents, and assign the appropriate categories.

It’s establishing the category trees that should be an enterprise responsibility. Call them the Chart of Accounts for unstructured data. Control that carefully and provide ample education. Muddling through the remaining work will probably work pretty well.

Better, at least, than the alternatives.

Comments (29)

  • Bob:

    I think you are missing a tremendous amount of research, with accompanying software programs. Think of all those programs to “index your hard drive” — Microsoft, Google, Yahoo, Copernicus, et al. Also, think of all those programs that purport to “index” the web — Google, Yahoo, Bing, et al. Both make major strides in and among “unstructured data.”
    Finally, think of efforts like Wolfram, that even tries to work with such data sources to produce spreadsheets, charts, and other analytical output, beyond just retrieval.

    I think there is a fair amount of work on handling “unstructured data.”

    • Just an opinion: Search isn’t the same as effective data management. In fact, it’s importance is probably inversely proportional to effective data management: The more poorly data are organized, the more you need search.

      What Wolfram is doing is interesting stuff, but I don’t think it offers much help to an organization that’s drowning in uncontrolled documents, film clips, photos and so on.

  • Bob,
    I don’t understand what’s inadequate about the concept of metadata that can be associated with any “file” containing unstructured data. The unstructured data problem is that we can’t anticipate how (or if) someone else might wish to refer to our “carefully considered and artfully expressed musings.” Thus, search engines.

    Within certain disciplines there is often some pre-definition of certain metadata elements. For documents, the Dublin Core is a reasonable starting point. Of course, this requires enough discipline on the part of the author to actually provide the metadata, although there are some engines capable of recognizing and extracting obvious topics. It’s not perfect, but a structured approach to categorizing unstructured information is not going to make any sense. (There’s something comforting in the phrase “structured unstructured information” that suggests that the world cannot be subdivided into strict Cartesian subsets.) There’s a corresponding piece that is also missing — the ability of the author(s) to control who may access the information and to control whether the information is even visible in search results to someone who isn’t authorized to view it.

    Incidentally, there are active efforts underway within parts of the U.S. government to provide metadata definitions for certain domains. Please refer to NIEM and UCORE for examples.

    What am I missing in your appeal for a design paradigm for unstructured data?


    • I don’t claim to have deep knowledge of this domain. What I do know is that as a practical matter, organizations don’t have anything remotely corresponding to third-normal-form design for document storage.

      To the best of my admittedly limited knowledge, most of the work that’s been done in this area is in the internal structuring of documents, not in the nuts and bolts of storing and retrieving it in an organized way.

      Put it differently: Every company I’ve seen considers its unstructured data to be an unmanageable mess, and none of them even know where to look for guidance on how to get their arms around it. That tells me the problems that have been solved probably aren’t the problems companies have.

  • I’m a bit biased because my company is a dealer for the product, but something like Brightech’s Mediabeacon seems like an answer to your unstructured data problem. It’s a digital asset management app. There are others such as Web Native, Mediabin and Cumulus. The key for most of these is multiple levels of taxonomy and tags tied to a number of XML schemes, notably XMP. Combined with tags and folder structure, the level of information that can be attached to a single file is huge. Plus it supports versioning and most of the standard doc management features.

    The software is out there. Very few companies know how to value it internally or charge for it externally. It takes effort on the part of the asset creator to append meaningful information to the file as well as a group to manage the fit within the taxonomy. Most systems I’ve seen are not terribly difficult to install but they require constant attention to detail and maintaining tags and schema.

    Next time you wander through Minneapolis, you might want to stop in for a demo of Mediabeacon. It might change your mind on managing the chaos of “unstructured” data.

  • The best piece of software I’ve seen for managing unstructured data is here: http://www.thebrain.com Its a commercial product primarily aimed at, I think, sales professionals. In its current state is aimed at only small bits of unstructured data from a single visibility level, so I’m not thinking it solves the problem, but it is an interesting approach and might be worth a look.


  • Hi Bob,

    I definitely concur. I led a project a few years ago to implement an OLAP database for planning data, and it occurred to me that it would be really powerful to store files in a similar structure. A company could define dimensions have the relevant entities in the organization own the dimensions and just go to town.

    As Roger says, you still have to have the discipline to actually populate the data, but without some reasonable structure nobody will populate it.

    Chicken or egg?


  • Why didn’t you mention the fact that we can already locate hidden in unstructured data (documents) regardless of where they are physically stored? Consider that there isn’t a need to store resume data in a database or folder structure. Thanks for thinking outside the folder – Mark

  • The key to any data management system is actually managing the data. Regular relational inventory systems are usually a mess because of either poor design or/and lack of data management–updating.

    A well-thought search could collect meaningful counts of certain types of knowledge but only if the directory or domain it searched was restricted to ‘useful’ documents.

    I think a well-organized collection of directories of files-that-are-important along with a search guru and private google could serve up useful insight.

    Note that managing the files is not the same as managing the work. If your work itself is unstructured, such as no one knowing which is the final version of truth, then you have other issues that will not be solved by any sort of database.

    Writing a new inventory system doesn’t fix the old one if the only problem with the old one is that no one keeps anything updated. All implementing a new inventory dbs does is force you to get an accurate count at implementation time.

    Unstructured data within files is a different problem than ungrouped, unorganized, unknown collections of files.

  • From a content management perspective, integration with Sharepoint is beginning to develop. EMC/Documentum is offering a couple of Sharepoint facing products which allow users to work in the familiar and more accessible Sharepoint UI but take advantage of the increased metadata and relationships resources of a content management system. It’s still not Knowledge Management, but it’s a step toward it in that continuum. Still has limitations in needing an enterprise level back end with attendant technical support.

  • Unstructured data is a bogeyman.

    There can be no such thing. No structure means chaos, maximal entropy, and concomitant lack of any meaning. Anything that has any meaning has at least one “structure”.

    Your post has structure.

    In fact, it has many overlapping structures. It has an intro-body-conclusion structure, it has a title-paragraph-sentence structure, it has a successive-nested-concepts structure.

    That’s the real problem; figuring out which structure to use, or being able to use multiple structures at the same time.

    This goes hand-in-hand with the industry-wide lack of understanding of the Relational Model. That model can represent any structure. What is a bit of an issue is the current implementations, the baggage they carry and the limitations they impose since none fully represent Relational theory completely, as far as I know.

    Actually, the real problem is the people (a.k.a. vendors) who keep pushing the idea of unstructured storage. 🙂

  • Hi Bob – I’ve been working with one of the DM/ECM products that wraps categories, subcategories and other metadata around unstructured data objects for many years now. I’ve come to the conclusion that this, plus some motivation to categorize your data, plus a good search engine, is about the best we’re going to get short of a breakthrough in artificial intelligence. Building structure in an automated way from unstructured data would require creating and maintaining context- and domain-specific rulesets (among others) way beyond the skill of mere mortals like myself.

  • The “chart of accounts” analogy is a bit off–the concept you are driving toward is tagging. The key difference is that “categories” are themselves often too structured. Tags, on the other hand, are usually completely free-form.

    There is still the need for education to produce some form of consistency, but the beauty of tags is that so long as people err on the side of “over-tagging”, then everyone else will be able to find what they are looking for.

    My personal document management solution involves tagging everything, then using a tag navigator (as a Mac user, I prefer Yep! and Leap!).

    • We appear to have irreconcilable differences. The Chart of Accounts approach is exactly what I want. Here’s what happens with free form tags: One user applies the free-form tag “Happy,” a second user applies the free-form tag “Glad,” and the system classifies these two clearly related pieces of unstructured content as having nothing to do with each other.

      Avoiding this problem is what the Chart of Accounts does for accounting transactions – it makes sure like transactions all fall in the same place.

      • I used to have exactly the solution you are seeking. She was called a “secretary” and she had a bank of file cabinets. Her special talent was that she could find anything in that mess (and she kept everything, too). Of course, the world was smaller, we didn’t have email or word processors, it was a lot of work to write a memo and you carefully considered whether it was necessary.

        So, would you accept a chart of accounts in which a single document could appear in multiple accounts? (That seems to be what we want, i.e., one document appearing in multiple folders.) What it seems like you’re seeking is a way to force the “tags” into some common structure. Right?

        Simplistic approach: start with the company organization chart, the employee list, the customer list, the supplier list, the product list, etc. Create your initial charts(s) of accounts, in hierarchical form, from these. Whenever a user saves a document (email, Word, Excel, Powerpoint, …), invoke a utility that presents those lists as checkboxes. Some simple semantic analysis could precheck some of those “accounts, e.g., by simply searching for those words (be most effective if this occurred as the document was being written/edited, like spellcheck). Allow new “accounts” to be added by anyone. It will soon be a mess, but it will reasonably capture the topics of interest to the organization. Trim the (display) tree regularly based on actual usage, i.e., number of documents tagged with that account. Among other things, this will identify orphans. Older documents won’t be tagged with newer accounts, but it’s a start.

        Ummm, that’s as close as I can get to replicating whatever magic that secretary had…

      • I’m mostly with Bob on this. In IT, we are advising users on how to create categories (Chart of Accounts) for tags/metadata within SharePoint. That allows us to build views of the documents that address the needs of multiple user roles.

        While we can’t possible anticipate every need, we’ve found that even this simple exercise addresses 90% of the need. Compared to unmanaged resources like shared drives, simple knowledge management can quickly and cheaply provide enormous benefits.

        What’s surprising to me is that we build client-facing systems like this for clients in the Fortune 500 and there research departments don’t have these systems and are asking us for expertise and consulting.

  • I’ve worked on tackling the issue of managing unstructured data several times in my career. Enterprise Content Management systems (there are several commercial products and open source projects) attempt to address the need with varying approaches, but most use a metadata approach. This method is also in many cases supplemented by some automated tooling that provides inspection into known file types and attempts to infer and populate metadata from the file.

    However my experience has been that the greatest challenge comes from defining an appropriate taxonomy for the metadata. And additionally how to evolve the taxonomy over time without losing the content already captured and classified.

    What I think seems to work best is a combination approach: having a defined and managed taxonomy with some mandatory and many optional (but still predefined) elements; and also allowing for a purely user-defined, unbounded set of “tags”.

    Now I understand that this is primarily seen as a mechanism to support search, but before I’m accused of misunderstanding the gist of your column, I believe that once there is a critical mass of content that is properly classified and tagged, analysis of the metadata itself can lead to many interesting and informative observations. Distributions can lead to a better understanding of resource and effort allocation in the organization. User managed tags analysed against the time dimension can lead to an understanding of interest and trends. And ultimately, humans consuming the actual content that the metadata leads them to is the primary value.

    Having said all that, I’ve yet to see an enterprise that exhibits the discipline to implement and leverage a metadata-driven ECM approach universally. There must be some out there; maybe someone here has done it and reaped the benefits (or proved my assumptions wrong!).

  • There are some things that aren’t “manageable” in the sense that you mean. All you can really do is kind of contain them.

    I work in a library. You’d think that a box of Librarians could agree on how to organize and store all of their unstructured data, but the truth is unless you have some Directive From Above
    that We Will Use $Metadata_format, it won’t happen. And even then, items that don’t fit will still be generated and stored elsewhere.

    The organizing of unorganizable data schemes that I’ve seen revolve around training software or training people. We’ve kind of done both here, not so much with document standards, but by creating a culture of documentation. We don’t mandate, but we’ve trained folks to store their important stuff on the network. (Some have learned the hard way)

    We are just now moving all of our disparate “intranet” storage to one Sharepoint site. We had a couple of WSS version 2 sites, a Frontpage Frankenstein, and a mess of shared drives.

    The shared drive cannot completely go away. There are file types that Sharepoint doesn’t handle, and some that the data center hosting our sharepoint won’t let us upload.

    Also, there’s just not always a need to have everything structured just so. A temporary folder for a few folks doing research is sufficient for their purposes, and if it turns into a production deal, the information can move to its own site or folder.

    The only thing keeping unstructured data from getting completely out of hand is an agreement that whoever owns the data will cull it every so often. We try to get every data owner to answer a series of questions about their data and what they’re going to do with it. This helps focus people on what they’re doing and why.

    For orphaned stuff, usually a manager or IT will make an inquiry about files more than X years old and either nuke it or move it elsewhere.

    Here’s some overly structured data:


    • Wow. A 26 page recipe for oatmeal cookies. I bet they taste horrible. Of course, I had a similar problem when I attempted to document my grandmother’s recipe for oatmeal cookies — throw a bunch of oatmeal in a bowl, add some salt, sprinkle with flour, toss in some nuts (whatever kind you have, but walnuts are good, or substitute raisins), then … Her cookies were fantastic and mine taste like cardboard 🙂

  • This is something I’ve been thinking about for awhile. The best thing I can come up with is something like Nepomuk ( http://nepomuk.semanticdesktop.org/xwiki/bin/view/Main1 ) using a Directory (the old hammer-nail effect) to act as the utility to hold the attributes. Hook it into something like the Semantic Web (yeah, yeah, I know) and it may become useful. However, the real issue is defining the constraints for the knowledge space, that’s going to take some degree of consensus and cooperation.

    Fat chance if humans are involved.

  • In 1988, Lotus Development released an amazing product for managing unstructured data called “Agenda.” It was marketed as a “Personal Information Manager” but was actually capable of doing far more complex tasks than parsing “a week from Thursday” into an actual date you could sort.

    I was asked to develop some sample applications for the launch to demonstrate some of its more advanced analytical capabilities. I used Agenda to create a model of the United States Constitutional Convention that found “The Great Compromise of 1787” solving Congressional representation. I also used it to analyze published stock market analysis in the 1920’s that predicted the Crash of 1929.

    Agenda ran on a PC with 640k!! of RAM under DOS 2.0. Its success was limited by the hardware available at the time and the utter novelty of what it could do. Lotus soon abandoned it for a more conventional PIM and eventually released it to the public domain. You can still download it and try it yourself. Agenda is the most impressive piece of computer software I have encountered and should be a model for any future Unstructured Information Manager

  • Actually, librarians have been working on this since the invention of the computer. They’ve invented all sorts of searching and categorization tools (and even international standards) for everything.

    It’s a shame that IT people are reinventing the wheel when it’s available at the nearest public library…

    • 🙂 See my above on the box of Librarians. It’s true they’ve been working on it, but between Dublin Core, MARC, and half a dozen other schemes, there’s always going to be something that won’t fit just so, and the Catalogers I know are all about making things fit just so.

      Now since this article came out, I’ve been looking up “chart of accounts” and all that goes with it. This guy’s thing:


      looks for all the world like designing an RDBMS. All this time we’ve been doing accounting? 🙂

      Anyway, rereading the second half of the article, talking about “who”, “when”, and “what”, it sounds like really what an organization needs is a communications plan.

      Multiple editors of a document agree beforehand that they’ll share it in email, lock it in Sharepoint before editing, or use a naming convention whenever they edit.

      Scaling something like that to an enterprise level is not worth the time it would take, really. Not everyone communicates the same way. Getting a group larger than three to agree on any plan is an accomplishment. At most, you might get a department of 50 to do things in a general way: naming structure, locking files, or email.

      (And a lot of times, that “agreement” is “the boss says so”)

      On the Chart of Accounts vs tags or even semi-structured tagging: there’s still the problem of miscategorization, and who is it that does the education? It doesn’t sound far removed from the professional staff.

      At least with semi-structured tagging, you have a list and can use a close match algorithm to find “glad” and “happy” things.

  • The tricky part will be moving beyond the venerable folder tree. The good news: We have all the tools we need, in the form of an ability to create multiple sets of categories and sub-categories, and to assign an unlimited number of categories and sub-categories to any file.

    My first thought that is that Gmail has already done this, using labels instead of folders. It took a little getting used to but is quite powerful. Now, if they would only add sub-labels.

  • The meta-data process you describe is adequate for much “unstructured” data. The folder-tree metaphor, however, should not be abandoned. It provides a working model that is familiar to users, as long as they help define the category/subcategory/identifier elements (3-parts, all required) that comprise the meta-data. In fact, the meta-data should be tapped to build the tree during the “browse the tree” process.

    We faced this problem 15-years ago when implementing a document management system for compliance with chemical plant process safety. Of course, almost immediately, the need to support ISO 9000 appeared to help add a new layer of meta-data requirements. We call this our “controlled document” environment, as opposed to the free-for-all on network file shares. We also have a team to manage the filing of new items. Local updaters can add versions to previously filed items.

    A word of caution. Adding 5-different ways to tag a document with the 3-part meta-data elements does not add 5-times the value to the business. Be realistic. Our business rule is a document should have a primary 3-part assignment (similar to only one filing cabinet, drawer, folder can hold the paper equivalent), to support the tree, and secondary meta-data collections support the advanced search tool. This puts clear scope on the programming to mitigate trying to be all things to all people. It also provides low level users a simple way to find a required document. High level users can use advanced search for meta-data, and content-based retrieval for anything else.

    Also, what you said about “search” not being meta-data, I add “what he said, times two”. Leave “search” to the plaintiff lawyers; organize your knowledge to fit the normal business needs.

Comments are closed.