Analyzing the right data
0. A huge fraction of what’s important in analytics amounts to making sure that you are analyzing the right data. To a large extent, “the right data” means “the right subset of your data”.
1. In line with that theme:
- Relational query languages, at their core, subset data. Yes, they all also do arithmetic, and many do more math or other processing than just that. But it all starts with the set theory.
- Underscoring the power of this approach, other data architectures over which analytics is done usually wind up with SQL or “SQL-like” language access as well.
2. Business intelligence interfaces today don’t look that different from what we had in the 1980s or 1990s. The biggest visible* changes, in my opinion, have been in the realm of better drilldown, ala QlikView and then Tableau. Drilldown, of course, is the main UI for business analysts and end users to subset data themselves.
*I used the word “visible” on purpose. The advances at the back end have been enormous, and much of that redounds to the benefit of BI.
3. I wrote 2 1/2 years ago that sophisticated predictive modeling commonly fit the template:
- Divide your data into clusters.
- Model each cluster separately.
That continues to be tough work. Attempts to productize shortcuts have not caught fire.
4. In an example of the previous point, anomaly management technology can, in theory, help shortcut any type of analytics, in that it tries to identify what parts of your data to focus on (and why). But it’s in its early days; none of the approaches to general anomaly management has gained much traction.
5. Marketers have vast amounts of information about us. It starts with every credit card transaction line item and a whole lot of web clicks. But it’s not clear how many of those (10s of) thousands of columns of data they actually use.
6. In some cases, the “right” amount of data to use may actually be tiny. Indeed, some statisticians claim that fewer than 10 data points may be enough to get a good model. I’m skeptical, at least as to the practical significance of such extreme figures. But on the more plausible side — if you’re hunting bad guys, it may not take very many separate facts before you have good evidence of collusion or fraud.
Internet fraud excepted, of course. Identifying that usually involves sifting through a lot of log entries.
7. All the needle-hunting in the world won’t help you unless what you seek is in the haystack somewhere.
- Often, enterprises explicitly invest in getting more data.
- Keeping everything you already generate is the obvious choice for most categories of data, but some of the lowest-value-per-bit logs may forever be thrown away.
8. Google is famously in the camp that there’s no such thing as too much data to analyze. For example, it famously uses >500 “signals” in judging the quality of potential search results. I don’t know how many separate data sources those signals are informed by, but surely there are a lot.
9. Few predictive modeling users demonstrate a need for vast data scaling. My support for that claim is a lot of anecdata. In particular:
- Some predictive modeling techniques scale well. Some scale poorly. The level of pain around the “scale poorly” aspects of that seems to be fairly light (or “moderate” at worst). For example:
- In the previous technology generation, analytic DBMS and data warehouse appliance vendors tried hard to make statistical packages scale across their systems. Success was limited. Nobody seemed terribly upset.
- Cloudera’s Data Science Workbench messaging isn’t really scaling-centric.
- Spark’s success in machine learning is rather rarely portrayed as centering on scaling. And even when it is, Spark basically runs in memory, so each Spark node is processing all that much data.
10. Somewhere in this post — i.e. right here 🙂 — let’s acknowledge that the right data to analyze may not be exactly what was initially stored. Data munging/wrangling/cleaning/preparation is often a big deal. Complicated forms of derived data can be important too.
11. Let’s also mention data marts. Basically, data marts subset and copy data, because the data will be easier to analyze in its copied form, or because they want to separate workloads between the original and copied data store.
- If we assume the data is on spinning disks or even flash, then the need for that strategy declined long ago.
- Suppose you want to keep data entirely in memory? Then you might indeed want to subset-and-copy it. But with so many memory-centric systems doing decent jobs of persistent storage too, there’s often a viable whole-dataset management alternative.
But notwithstanding the foregoing:
- Security/access control can be a good reason for subset-and-copy.
- So can other kinds of administrative simplification.
12. So what does this all suggest going forward? I believe:
- Drilldown is and will remain central to BI. If your BI doesn’t support robust drilldown, you’re doing it wrong. “Real-time” use cases are not exceptions to this rule.
- In a strong overlap with the previous point, drilldown is and will remain central to monitoring. Whatever monitoring means to you, the ability to pinpoint the specific source of interesting signals is crucial.
- The previous point can be recast as saying that it’s crucial to identify, isolate and explain anomalies. Some version(s) of anomaly management will become a big deal.
- SQL and “SQL-like” languages will remain integral to analytic processing for a long time.
- Memory-centric analytic frameworks such as Spark will continue to win. The data size constraints imposed by memory-centric processing will rarely cause difficulties.
Related links
- Other recent “unifying-theme” posts focused on monitoring and coordination.
- My 2013 post on what matters in investigative analytics still holds up pretty well.
Comments
Leave a Reply