February 13, 2013

It’s hard to make data easy to analyze

It’s hard to make data easy to analyze. While everybody seems to realize this — a few marketeers perhaps aside — some remarks might be useful even so.

Many different technologies purport to make data easy, or easier, to an analyze; so many, in fact, that cataloguing them all is forbiddingly hard. Major claims, and some technologies that make them, include:

“We get data into a form in which it can be analyzed.” This is the story behind, among others:
- Most of the data integration and ETL (Extract/Transform/Load) industries, software vendors and consulting firms alike.
- Many things that purport to be “analytic applications” or data warehouse “quick starts”.
- “Data reduction” use cases in event processing.*
- Text analytics tools.
- Splunk.
“Forget all that transformation foofarah — just load (or write) data into our thing and start analyzing it immediately.” This at various times has been much of the story behind:
- Relational DBMS, according to their inventor E. F. Codd.
- MOLAP (Multidimensional OnLine Analytic Processing), also according to RDBMS inventor E. F. Codd.
- Any kind of analytic DBMS, or general purpose DBMS used for data warehousing.
- Newer kinds of analytic DBMS that are faster than older kinds.
- The “data mart spin-out” feature of certain analytic DBMS.
- In-memory analytic data stores.
- Hadoop.
- NoSQL DBMS that have a few analytic features.
- TokuDB, similarly.
- Electronic spreadsheets, from VisiCalc to Datameer.
- Splunk.
“Our tools help you with specific kinds of analyses or analytic displays.” This is the story underlying, among others:
- The business intelligence industry.
- The predictive analytics industry.
- Algorithmic trading use cases in complex event processing.*
- Some analytic applications.
- Splunk.

*Complex event/stream processing terminology is always problematic.

My thoughts on all this start:

There are many possibilities for the “right” way to manage analytic data. Generally, these are not the same as the “right” way to write the data, as that choice needs to be optimized for user experience (including performance), reliability, and of course cost.
I.e., it is usually best to move data from where you write it to where you (at least in part) analyze it.
Vendors who suggest they have a complete solution for getting data ready to be analyzed are … optimists.
This specifically includes “magic data stores”, such as fast analytic RDBMS (on which I’m very bullish) or in-memory analytic DBMS (about which I’m more skeptical). They’re great starting points, but they’re not the whole enchilada.
There are many ways to help with preparing data for analysis. Some of them are well-served by the industry. Some, however, are not.

Further:

1. There are many terms for all this. I once titled a post “Data that is derived, augmented, enhanced, adjusted, or cooked”. “Data munging” and “data wrangling” are in the mix too. And I’ve heard the term data preparation used several different ways.

2. Microsoft told me last week that the leading paid-for data products in their data-for-sale business are for data cleaning. (I.e., authoritative data to help with the matching/cleaning of both physical and email addresses.) Salesforce.com/data.com told me something similar a while back. This underscores the importance of data cleaning/data quality, and more generally of master data management.

Yes, I just said that data cleaning is part of master data management. Not coincidentally, I buy into to the view that MDM is an attitude and a process, not just a specific technology.

3. Everybody knows that Hadoop usage involves long-ish workflows, in which data keeps get massaged and written back to the data store. But that point is not as central to how people think about Hadoop as it probably should be.

4. One thing people have no trouble recalling is that Hadoop is a great place to dump stuff and get it out later. Depending on exactly what you have in mind, there are various metaphors for this, most of which have something to do with liquids. Most famous is “big bit bucket”, but also used have been “data refinery”, “data lake”, and “data reservoir”.

5. For years, DBMS and Hadoop vendors have bundled low-end text analytics capabilities rather than costlier state-of-the-art ones. I think that may be changing, however, mainly in the form of Attensity partnerships.

Truth be told, I’m not wholly current on text mining vendors — but when I last was, Attensity was indeed the best choice for such partnerships. And I’m not aware of any subsequent developments that would change that conclusion.

Related links:

Merv Adrian’s contrast between Hadoop and data integration tours some of the components of ETL suites. (February, 2013)
Part of why analytic applications are usually incomplete are the issues discussed in this post.
De-anonymization is an important — albeit privacy-threatening — way of making data more analyzable. (January, 2011)
I updated my thoughts on Gartner’s Logical Data Warehouse concept earlier this month.

Categories: Business intelligence, Data warehousing, Derived data, EAI, EII, ETL, ELT, ETLT, Hadoop, In-memory DBMS, Investment research and trading, Memory-centric data management, Microsoft and SQL*Server, MOLAP, NoSQL, Predictive modeling and advanced analytics, salesforce.com, Splunk, Streaming and complex event processing (CEP), Text

Subscribe to our complete feed!

Comments

6 Responses to “It’s hard to make data easy to analyze”

Keith Kohl on February 15th, 2013 9:18 am

Great post Curt, thanks for this. I hear exactly the same thing from organizations and system integrators. I put a comment out on Merv Adrian’s blog post you referenced above about some of these problems.

Joining data, especially where both sides of the join are large, is a very difficult in a distributed environment like Hadoop. I hear this time and time again where MR developers have had to write hundreds of lines of complex Java code to get it to work.

As mentioned in my response to that blog post, and I remember we discussed this sometime ago, ETL is a very common use case in Hadoop. You state above “Hadoop is a great place to dump stuff”, but if you want to operate on it and/or extract it, you need to perform ETL (filter data, joining data, etc.).
A Useful Product Service That Someone Should Create « Enterprising Thoughts on February 18th, 2013 2:03 am

[…] be analyzed, it must first be prepared (consolidated, cleansed, munged, etc.). Curt Monash wrote a great post about this […]
Brian Hoover on February 19th, 2013 3:47 pm

Making data available in the right format at the right time for analysis is a great topic, and one that deserves more discussion. Thanks for pointing it out.

Just an observation – I agree MDM is an attitude and a process, and I believe that data quality is both a pre-requisite to MDM (standardization of reference data to enable master data matching) and is enabled by MDM (using master data as a reference validation to avoid the introduction and propogation of poor data quality).

However, data quality is not a subset of MDM – transaction data that doesn’t fall into the domain of MDM can and should conform to data quality standards and processes and may or may not be enhanced by MDM. Sometimes the MDM vendors are overselling MDM as a panacea for all enterprise data quality and transformation needs when there are more cost effective data quality specific or ETL solutions.

Ideally MDM should be taking place on the front-end of operational systems so that data quality is enforced before it enters the warehouse or analytical stream, not as a “fix” in preparation for analysis.
Which analytic technology problems are important to solve for whom? | DBMS 2 : DataBase Management System Services on April 9th, 2015 7:54 am

[…] ETL reduction/elimination (Extract/Transform/Load) is a major need. […]
Analyzing the right data | DBMS 2 : DataBase Management System Services – Cloud Data Architect on April 14th, 2017 1:24 am

[…] acknowledge that the right data to analyze may not be exactly what was initially stored. Data munging/wrangling/cleaning/preparation is often a big deal. Complicated forms of derived data can be important […]
Big Data is All About the Analytics - Enterra Solutions on March 10th, 2021 5:09 am

[…] Monash asserts, “It’s hard to make data easy to analyze.” [DBMS2, 13 February 2013] He continues, “There are many ways to help with preparing […]

Leave a Reply

Search our blogs and white papers

Monash Research blogs

DBMS 2 covers database management, analytics, and related technologies.
Text Technologies covers text mining, search, and social software.
Strategic Messaging analyzes marketing and messaging strategy.
The Monash Report examines technology and public policy issues.
Software Memories recounts the history of the software industry.

User consulting

Building a short list? Refining your strategic plan? We can help.

Vendor advisory

We tell vendors what's happening -- and, more important, what they should do about it.

Monash Research highlights

Learn about white papers, webcasts, and blog highlights, by RSS or email.

Links
- Monash Research
- White Papers
Admin
- Log in

It’s hard to make data easy to analyze

Comments

Search our blogs and white papers

Monash Research blogs

User consulting

Vendor advisory

Monash Research highlights

Recent posts

Categories

Date archives

Admin