HomeIndustry Commentary

Unstructured data design – a modest proposal

Like Tweet Pin it Share Share Email

If there’s one certainty in our business, it’s that useful, lightweight frameworks turn into bloated, productivity-destroying methodologies.

And so it was with considerable trepidation last week that I suggested we need another methodology, to do for Content Management Systems (CMSs — the technologies we use to manage unstructured information) what normalization and related techniques do for relational database management systems (“Unstructured data design — the missing methodology,” KJR, 5/17/2010).

But we do. As evidence, I offer many of the comments and e-mails I received suggesting we don’t: Most pointed to the existence of well-developed tools that allow us to attach metadata to “unstructured content objects” (documents, spreadsheets, presentations, digital photos, videos and such, which we might as well acronymize now and have done with it: For our purposes they’re now officially “UCOs”).

And many more pointing out that search obviates the need for categorization.

Let’s handle search first, because it’s easier: Search is Google listing 16,345,321,568 UCOs that might have what you’re looking for. It’s also what leads to (for example) searches for “globe,” “sphere,” “orb,” and “ball” yielding entirely different results.

Search is what you do when you don’t have useful categories.

And then there’s metadata — the subject that proves we don’t have what we need. Because while we have the ability to attach metadata to UCOs, we have only ad hoc methods for deciding what that metadata should be.

Some readers suggested this might be a solved problem. Books are UCOs, and librarians have been categorizing them for centuries. Between the Dewey Decimal System and the Library of Congress Classification, surely there’s a sound basis on which to build.

Maybe there is. I’m skeptical but not knowledgeable enough to state with confidence they won’t work. I’m skeptical because their primary purpose is to place books in known locations in the library so they can be readily found, which means they’re probably similar to the single folder trees we need to move beyond.

What we need, that is, is the ability to place one UCO in as many different locations as anyone might logically expect to find it. To use one of my own books as an example, Leading IT: The Toughest Job in the World would fit in at least these categories: Leadership, Information Technology, Staffing, Decision-making, Motivation, Culture change, and Communication skills.

The official name for the discipline of defining knowledge domains … what we’re trying to do … is ontology. It’s an active area of development, including the creation of standards (such as OWL, which puzzlingly stands for “Web Ontology Language” instead of “Ontology Web Language,” but let it pass).

From what I’ve been able to determine, though, it appears everything being developed thus far falls under the heading of tools, with a useful methodology nowhere in sight. This is, perhaps, unsurprising as it appears philosophers have been discussing the subject at least since Aristotle first introduced it 2,350 years ago or so, without yet arriving at a consensus.

It could be awhile.

Of course, philosophers are obliged to develop systems so universal they apply, not only to our universe, but to all possible universes. Such is the nature of universal truth.

We don’t need to be quite so ambitious. We merely need to categorize information about our businesses. To get the ball rolling, I’ll offer up the framework we’ve synthesized at IT Catalysts. It enumerates ten topics that together completely describe any business — five internal and five external. They are:

Internal

  • People: The individual human beings who staff a business.
  • Processes: How people do their work.
  • Technologies: The tools people use to perform the roles they play in business processes.
  • Structure: Organizational structures, facilities, governance, accounting, and compensation — how the business is put together and interconnected.
  • Culture: The learned behavior people exhibit in response to their environment, and the shared attitudes that underlie it.

External

  • Products: Whatever the business sells to generate profitable revenue.
  • Customers: Whoever makes or influences buying decisions about the products a business sells.
  • Pricing: What the business charges for its products, terms and conditions of purchase, and the underlying principles that lead to them.
  • Marketplace: The business “ecosystem” in which the company exists, including customer groupings, competitors, partners, and suppliers.
  • Messages: How and what the business communicates with its marketplace.

There you go — a free gift, if you’ll forgive the redundancy. Just break these topics down into sub-topics and sub-sub-topics. The result should be a workable classification scheme.

Let me know when you’re done.

Comments (12)

  • Bob,

    Where would security fall within your classification scheme? By that, I mean protecting the company from loss (data loss, IP loss, financial loss, customer loss)? Often security is done as an afterthought. If it were designed into the system in the first place, it might be less costly to implement and operate. This is especially true in IT Security where software security flaws are patched by placing a firewall between the vulnerable systems and the world instead of fixing the flaw. Besides, I am sure corporations as well as the military and government classify data according to how sensitive it is, yet many businesses don’t really track who sees their sensitive documents or where they are stored. Software repositories are often not locked down (i.e. Google and China). There are currently no IT Security products that can alert you if your database has been hacked via SQL injection. An IDS/IPS may or may not catch a malicious SELECT statement if it has the signature, but there is no product currently that can tell you that the database is compromised. It gives a new meaning to your UCO acronym.

    John

    John

    • In our scheme, “security” ends up being a sub-topic in multiple trees – all five internal dimensions and at least two external ones.

      Internally, it has obvious Process and Technical aspects, a Structural component (that is, an organizational home, facilities consequences and so on); a knowledge-and-skills component (People), and depends a great deal on Culture.

      Externally, security applies to both the Customer and Marketplace domains as well, both because extranets are now commonplace, and because the Marketplace is where intruders reside.

  • Bob – I don’t think we need a new ontology methodology, because there already is one. Ontology is just the classification of things (objects, ideas, etc.). For example, a man is a adult male human, so the classification of man is just the intersection of the classification of adults, the classification of males and the classification of humans. Thus, all we need to define ontologies is to apply set theory. Or am I over simplifying?

    • My opinion? When you say, “… all we need …” you’re oversimplifying, because “applying set theory” skips the hard work of defining the sets … the exact problem I’m suggesting we need to solve.

  • Bob,

    The Dewey Decimal/Library of Congress systems aren’t necessarily bad starting points, because those systems have half of the puzzle defined for books; the other half is worked out in implementations specific to particular libraries.

    Look at it this way: a LOC number or Dewey Decimal Number does convey the location of the book in the library; but then any UCO system is going to have to identify where the object “is” (on what computer on the network, etc.). The DD/LOC systems assign those numbers based on the book’s primary topic, but that doesn’t mean other topics (categories, in your book example) can’t be included in the system. That’s where the implementation comes in.

    In the old days of card catalogs, each book typically had a minimum of three cards: One filed by title, a second filed by author, and a third filed by the primary topic of the book. Books often had multiple topic cards, akin to your example, if the library in question had room to store the cards (and the books!). On a network, space is a trivial issue, so your book could be cross-referenced under as many categories or topics as you desired – which is why online library catalogs are often more extensive than the card-based ones they replaced.

    For the people who say “search” covers it all – well, that’s fine for full-text indexing, although even that doesn’t always return all useful results and it often returns useless ones mixed in – hence SE and SEO company efforts to tweak the algorithms used and the information used in those algorithms. Further, as you noted, we’d want to catalog images, and an image file often contains no metadata at all, or only limited (pixel depth, size, date, perhaps, but the file has no concept of what it’s an image OF) metadata. Google does a good job of constructing that information based on surrounding text in a web page, especially from ALT tags, but what if you’re cataloging 40,000 random images? Something has to tag them for content, and that has to be a person.

  • Bob:

    On the internal side, I would at least add “supplies” or inputs — some sense of what is transformed by the firm to produce its “products”. In some cases (say a firm that cuts/shapes metal) the inputs would be fairly simple; in other cases, such as most farms, they might be quite complex — water, air, soil, seed, fertilizer, et al.

    • In our scheme, suppliers are a subcategory of Marketplace. Raw materials are part of Product. Other supplies (office supplies, repair parts and so on) are part of Structure.

      It doesn’t have to be that way, of course. Categorization schemes aren’t right or wrong. They’re complete or incomplete; consistent or inconsistent; clear or fuzzy; useful or useless; but they aren’t right or wrong, so if establishing a separate top-level classification of Supplies works for you, that’s what matters.

  • Thanks:) been puzzling over how to add flexible look up capability to a document imaging customization I am doing.

    The immediate simple implementation solution is to give them a big text string where they can drop in multiple words in the database that identifies the document characeristics for later searching/grouping/retrieval by category (Note: in this app, all docs are identified to an individual by default, therefor all related characteristics of the individual in the db are imputed to the doc – and we track the origin, where it came from, and who created it).

    One big problem is you can’t rely on accurate or consistent classification of the characteristics of any document, or the terms used, it all depends upon who thinks what about it when they define the doc.

    A refinement of that would be to predefine all allowable search terms (by category?), and free form entry then is minimized but still allowed, most categories and terms then are predefined.

    And maybe a further refinement would be to provide some ontological bucket fields as well, would speed up the retrievals by classification, but not prevent a global term search.

    But I think we need to add/have a methodology where any UCO carries around it’s own defining characterisitics in a readable form. That way you have less chance of getting a “nekked” UCO.

    So, thinking in this implementation of adding XML descripter block to end of each UCO, followed by the XML lenght in a single 4 byte field, and then any UCO only reader then can reconstitute both the ontology and the content of the UCO, and recreate the entire package at the other end.

  • Bob:

    Just for clarification: the Library of Congress and Dewey are two examples of classification systems…to assist with the organization and retrieval of physical materials.

    However librarians also catalog materials according to their subject(s)by using other tools such as the Library of Congress Subject Headings and Sears List of Subject Headings.

    By assigning multiple subject headings to an item, the number of “points of access” are increased when using finding aids (such as the old card catalog, now replaced by online catalogs). These catalogs, in essence databases themselves, allow for the all-powerful keyword search. Searchers are no longer required to be familiar with the controlled vocabulary of a formal classification system.

    The flaw in keyword searching is akin to your example of searching Google. You get plenty of results, but are they the results you wanted or expected? Show me the metadata!

    It’s my understanding that ontologies form a key element in the semantic web, which to me, takes us back to machine language. I’m not sure that making searching faster, or leaving categorization up to the machines makes for better results, but I am sure of the lesson learned from early automation: garbage in, garbage out.

    Bob, I hope you’re not saying that useful, lightweight frameworks turn into bloated, productivity-destroying methodologies as a result of the development of consensus about standards for metadata. Let the philosophers argue; talk to the librarians!

    • Not at all. I’m saying useful lightweight frameworks of all kinds always seem to expand over time to become ends in and of themselves. It’s much the same evolution that seems to afflict those who write policy manuals.

  • Hi Bob – I know of 2 other ways of looking at the universe and they both involve 6, ‘domains’ if you will.

    Zachman is one at: http://www.zachmaninternational.com/index.php/the-zachman-framework

    The other is used at CSC (Computer Sciences Corporation) as part of their Catalyst methodology – Business Process, Organization, Location, Application, Technology and Data.

  • Aha, now I see what you’re getting at with your chart of accounts concept. We have something similar in our SLA: Global Mission, Local Mission, Business Operations.

    Mostly we use that to determine the priority of work orders as they come in. We’ve started expanding its use for project and program management.

    That’s internal to my department and only really enforced in my unit. About the only thing we’ve been able to do with other departments is offer suggestions on how people can think about their data.

Comments are closed.