It’s hard to make data easy to analyze
It’s hard to make data easy to analyze. While everybody seems to realize this — a few marketeers perhaps aside — some remarks might be useful even so.
Many different technologies purport to make data easy, or easier, to an analyze; so many, in fact, that cataloguing them all is forbiddingly hard. Major claims, and some technologies that make them, include:
- “We get data into a form in which it can be analyzed.” This is the story behind, among others:
- Most of the data integration and ETL (Extract/Transform/Load) industries, software vendors and consulting firms alike.
- Many things that purport to be “analytic applications” or data warehouse “quick starts”.
- “Data reduction” use cases in event processing.*
- Text analytics tools.
- Splunk.
- “Forget all that transformation foofarah — just load (or write) data into our thing and start analyzing it immediately.” This at various times has been much of the story behind:
- Relational DBMS, according to their inventor E. F. Codd.
- MOLAP (Multidimensional OnLine Analytic Processing), also according to RDBMS inventor E. F. Codd.
- Any kind of analytic DBMS, or general purpose DBMS used for data warehousing.
- Newer kinds of analytic DBMS that are faster than older kinds.
- The “data mart spin-out” feature of certain analytic DBMS.
- In-memory analytic data stores.
- Hadoop.
- NoSQL DBMS that have a few analytic features.
- TokuDB, similarly.
- Electronic spreadsheets, from VisiCalc to Datameer.
- Splunk.
- “Our tools help you with specific kinds of analyses or analytic displays.” This is the story underlying, among others:
- The business intelligence industry.
- The predictive analytics industry.
- Algorithmic trading use cases in complex event processing.*
- Some analytic applications.
- Splunk.
*Complex event/stream processing terminology is always problematic.
My thoughts on all this start:
- There are many possibilities for the “right” way to manage analytic data. Generally, these are not the same as the “right” way to write the data, as that choice needs to be optimized for user experience (including performance), reliability, and of course cost.
- I.e., it is usually best to move data from where you write it to where you (at least in part) analyze it.
- Vendors who suggest they have a complete solution for getting data ready to be analyzed are … optimists.
- This specifically includes “magic data stores”, such as fast analytic RDBMS (on which I’m very bullish) or in-memory analytic DBMS (about which I’m more skeptical). They’re great starting points, but they’re not the whole enchilada.
- There are many ways to help with preparing data for analysis. Some of them are well-served by the industry. Some, however, are not.
Further:
1. There are many terms for all this. I once titled a post “Data that is derived, augmented, enhanced, adjusted, or cooked”. “Data munging” and “data wrangling” are in the mix too. And I’ve heard the term data preparation used several different ways.
2. Microsoft told me last week that the leading paid-for data products in their data-for-sale business are for data cleaning. (I.e., authoritative data to help with the matching/cleaning of both physical and email addresses.) Salesforce.com/data.com told me something similar a while back. This underscores the importance of data cleaning/data quality, and more generally of master data management.
Yes, I just said that data cleaning is part of master data management. Not coincidentally, I buy into to the view that MDM is an attitude and a process, not just a specific technology.
3. Everybody knows that Hadoop usage involves long-ish workflows, in which data keeps get massaged and written back to the data store. But that point is not as central to how people think about Hadoop as it probably should be.
4. One thing people have no trouble recalling is that Hadoop is a great place to dump stuff and get it out later. Depending on exactly what you have in mind, there are various metaphors for this, most of which have something to do with liquids. Most famous is “big bit bucket”, but also used have been “data refinery”, “data lake”, and “data reservoir”.
5. For years, DBMS and Hadoop vendors have bundled low-end text analytics capabilities rather than costlier state-of-the-art ones. I think that may be changing, however, mainly in the form of Attensity partnerships.
Truth be told, I’m not wholly current on text mining vendors — but when I last was, Attensity was indeed the best choice for such partnerships. And I’m not aware of any subsequent developments that would change that conclusion.
Related links:
- Merv Adrian’s contrast between Hadoop and data integration tours some of the components of ETL suites. (February, 2013)
- Part of why analytic applications are usually incomplete are the issues discussed in this post.
- De-anonymization is an important — albeit privacy-threatening — way of making data more analyzable. (January, 2011)
- I updated my thoughts on Gartner’s Logical Data Warehouse concept earlier this month.
Comments
6 Responses to “It’s hard to make data easy to analyze”
Leave a Reply
Great post Curt, thanks for this. I hear exactly the same thing from organizations and system integrators. I put a comment out on Merv Adrian’s blog post you referenced above about some of these problems.
Joining data, especially where both sides of the join are large, is a very difficult in a distributed environment like Hadoop. I hear this time and time again where MR developers have had to write hundreds of lines of complex Java code to get it to work.
As mentioned in my response to that blog post, and I remember we discussed this sometime ago, ETL is a very common use case in Hadoop. You state above “Hadoop is a great place to dump stuff”, but if you want to operate on it and/or extract it, you need to perform ETL (filter data, joining data, etc.).
[…] be analyzed, it must first be prepared (consolidated, cleansed, munged, etc.). Curt Monash wrote a great post about this […]
Making data available in the right format at the right time for analysis is a great topic, and one that deserves more discussion. Thanks for pointing it out.
Just an observation – I agree MDM is an attitude and a process, and I believe that data quality is both a pre-requisite to MDM (standardization of reference data to enable master data matching) and is enabled by MDM (using master data as a reference validation to avoid the introduction and propogation of poor data quality).
However, data quality is not a subset of MDM – transaction data that doesn’t fall into the domain of MDM can and should conform to data quality standards and processes and may or may not be enhanced by MDM. Sometimes the MDM vendors are overselling MDM as a panacea for all enterprise data quality and transformation needs when there are more cost effective data quality specific or ETL solutions.
Ideally MDM should be taking place on the front-end of operational systems so that data quality is enforced before it enters the warehouse or analytical stream, not as a “fix” in preparation for analysis.
[…] ETL reduction/elimination (Extract/Transform/Load) is a major need. […]
[…] acknowledge that the right data to analyze may not be exactly what was initially stored. Data munging/wrangling/cleaning/preparation is often a big deal. Complicated forms of derived data can be important […]
[…] Monash asserts, “It’s hard to make data easy to analyze.” [DBMS2, 13 February 2013] He continues, “There are many ways to help with preparing […]