I went to Las Vegas and lost ten dollars – on a Coke machine.—Garry Shandling
Year: 2024
Data Warehouse Renaissance
Data Warehouse technology is going through a bit of a renaissance—with more and better options in terms of hosting, performance, accessibility, and data transformation and processing options, at lower costs and headaches. Is it perfect? No, but at this point, the gaps are more about us humans than the technology.
Call it data warehousing’s third wave. The first wave relied on rigidly structured data warehouses, sometimes packaged into multiple smaller “data marts.” The second wave used hyperscale architectures to support “schema on demand” analytics without the first wave’s up-front detailed planning.
The third wave hasn’t arrived, but we can anticipate it: using artificial intelligence regimes to automate the schema on demand data-structuring and filtering process, greatly increasing data warehousing agility.
Assume, for the sake of argument, that you’re partway between the second and third waves. Let’s talk about some reasonable planning considerations to make adoption and the transition easier.
Governance—We are still talking about an enterprise system that everybody will be using and benefiting from. More so: your collective data warehouses have, in the aggregate, a broader array of stakeholders than anything else in your portfolios. So, you are going to be dealing with your colleagues a lot, and there may be a lot of reasonable (and unreasonable) questions and requests.
The best thing to do is to get ahead of it and come up with a governance mechanism that works for your organization. This governance body needs to be in reasonable agreement as to the budget, scope, serviceability and needs. Bonus points if you already have this worked out in your organization! You are well ahead of the problem. More bonus points if your governance solution recognizes the distinction between committees and councils. More bonus points if you have a single governance solution for each architectural grouping, as opposed to separate governance for each application, suite, and platform family.
Data ownership – One of the topics we alluded to last week is data confidentiality and user roles and rules within this shared system. All the leaders need to respect each other’s need to publish their data to the rest of the company WHEN AGREED UPON, but not before. Nothing destroys trust like using half completed data to make assumptions about your colleagues’ work. Strict management of user roles and access are incredibly important for building and maintaining trust.
The tricky part is finding the right point of balance separating supporting “ownership” (or, better, “stewardship,” and embracing the dysfunction of organizational siloes.
Data quality— We cannot build a trusted single source of the truth with garbage data coming in. We must insist that all systems that contribute data must provide data that are:
Clean (can pass bounds checks, makes logical sense, and reflects the data in the transactional system)
Complete (The system is getting all of the data from the transactional system)
Documented enough (so that it can be found and used appropriately- and not outside of its own limits).
Statistically legitimate (tested for flaws such as autocorrelation, heteroskedasticity, and insufficient sample size)
Metadata—We live in 2024. If you must think too much about how to manage metadata, you have wrong platform. That doesn’t mean you’re actually managing metadata, though. That’s a process question. Please consider the need for metadata management to be your technical tip of the week.
Platform and hosting—There are great cloud options, managed by companies valued in the billions. There are also really creative solutions like Doris that are open source and can be hosted wherever. I think the real question is more about how mature your organization’s ability is to support this mission critical system. Look yourself in the mirror for a moment and ask yourself if your team can support your data warehousing platforms internally, if you have a good Enterprise Managed Services partner, or if you feel you can work with a provider that truly manages this mission for you as a service.
Machine learning and generative AI – Opinion: while these will become your data warehouse’s most important consumers sometime in the indefinite future, they won’t be ready for prime time until Explanatory AI has matured.
Build with the end in mind– Perhaps the most important consideration is to know what kind of decisions people (leaders as well as staff) are trying to make. If you know what kind of decisions need to be made, you can start to offer options regarding the presentation of the data, and more importantly, the synthesized Information that best helps people make those decisions.
After you know these points, you can make intelligent decisions about the schema, performance optimizations, as well as how often the upstream systems need to update your Data Warehouse, what are the necessary elements of the data, and where do they come from, and finally, how do we test the quality of what data comes in.
Working backwards (in this case) turns out to be going forwards, actually.