Predictive modeling and advanced analytics
Discussion of technologies and vendors in the overlapping areas of predictive analytics, predictive modeling, data mining, machine learning, Monte Carlo analysis, and other “advanced” analytics.
Some stuff that’s always on my mind
I have a LOT of partially-written blog posts, but am struggling to get any of them finished (obviously). Much of the problem is that they have so many dependencies on each other. Clearly, then, I should consider refactoring my writing plans. 🙂
So let’s start with this. Here, in no particular order, is a list of some things that I’ve said in the past, and which I still think are or should be of interest today. It’s meant to be background for numerous posts I write in the near future, and indeed a few hooks for such posts are included below.
1. Data(base) management technology is progressing pretty much as I expected.
- Vendors generally recognize that maturing a data store is an important, many-years-long process.
- Multiple kinds of data model are viable …
- … but it’s usually helpful to be able to do some kind of JOIN.
- To deal with the variety of hardware/network/storage arrangements out there, layering/tiering is on the rise. (An amazing number of vendors each seem to think they invented the idea.)
2. Rightly or wrongly, enterprises are often quite sloppy about analytic accuracy.
- My two central examples have long been inaccurate metrics and false-positive alerts.
- In predictive analytics, it’s straightforward to quantify how much additional value you’re leaving on the table with your imperfect accuracy.
- Enterprise search and other text technologies are still often terrible.
- After years of “real-time” overhype, organizations have seemingly swung to under-valuing real-time analytics.
Categories: Data models and architecture, Database diversity, Predictive modeling and advanced analytics, Public policy, Theory and architecture | 5 Comments |
Notes on artificial intelligence, December 2017
Most of my comments about artificial intelligence in December, 2015 still hold true. But there are a few points I’d like to add, reiterate or amplify.
1. As I wrote back then in a post about the connection between machine learning and the rest of AI,
It is my opinion that most things called “intelligence” — natural and artificial alike — have a great deal to do with pattern recognition and response.
2. Accordingly, it can be reasonable to equate machine learning and AI.
- AI based on machine learning frequently works, on more than a toy level. (Examples: Various projects by Google)
- AI based on knowledge representation usually doesn’t. (Examples: IBM Watson, 1980s expert systems)
- “AI” can be the sexier marketing or fund-raising term.
3. Similarly, it can be reasonable to equate AI and pattern recognition. Glitzy applications of AI include:
- Understanding or translation of language (written or spoken as the case may be).
- Machine vision or autonomous vehicles.
- Facial recognition.
- Disease diagnosis via radiology interpretation.
4. The importance of AI and of recent AI advances differs greatly according to application or data category. Read more
Categories: Cloud computing, Predictive modeling and advanced analytics, Public policy, Surveillance and privacy | 4 Comments |
Imanis Data
I talked recently with the folks at Imanis Data. For starters:
- The point of Imanis is to make copies of your databases, for purposes such as backup/restore, test/analysis, or compliance-driven archiving. (That’s in declining order of current customer activity.) Another use is migration via restoring to a different cluster than the one that created the data in the first place.
- The data can come from NoSQL database managers, from Hadoop, or from Vertica. (Again, that’s in declining order.)
- As you might imagine, Imanis makes incremental backups; the only full backup is the first one you do for that database.
- “Imanis” is a new name; the previous name was “Talena”.
Categories: Cassandra, Hadoop, Market share and customer counts, NoSQL, Predictive modeling and advanced analytics, Vertica Systems | 1 Comment |
Analytics on the edge?
There’s a theory going around to the effect that:
- Compute power is and will be everywhere, for example in cars, robots, medical devices or microwave ovens. Let’s refer to these platforms collectively as “real-world appliances”.
- Much more data will be created on these platforms than can reasonably be sent back to centralized/cloudy servers.
- Therefore, cloud-centric architectures will soon be obsolete, perhaps before they’re ever dominant in the first place.
There’s enough truth to all that to make it worth discussing. But the strong forms of the claims seem overblown.
1. This story doesn’t even make sense except for certain new classes of application. Traditional business applications run all over the world, in dedicated or SaaSy modes as the case may be. E-commerce is huge. So is content delivery. Architectures for all those things will continue to evolve, but what we have now basically works.
2. When it comes to real-world appliances, this story is partially accurate. An automobile is a rolling network of custom Linux systems, each running hand-crafted real-time apps, a few of which also have minor requirements for remote connectivity. That’s OK as far as it goes, but there could be better support for real-time operational analytics. If something as flexible as Spark were capable of unattended operation, I think many engineers of real-world appliances would find great ways to use it.
3. There’s a case to be made for something better yet. I think the argument is premature, but it’s worth at least a little consideration. Read more
Analyzing the right data
0. A huge fraction of what’s important in analytics amounts to making sure that you are analyzing the right data. To a large extent, “the right data” means “the right subset of your data”.
1. In line with that theme:
- Relational query languages, at their core, subset data. Yes, they all also do arithmetic, and many do more math or other processing than just that. But it all starts with the set theory.
- Underscoring the power of this approach, other data architectures over which analytics is done usually wind up with SQL or “SQL-like” language access as well.
2. Business intelligence interfaces today don’t look that different from what we had in the 1980s or 1990s. The biggest visible* changes, in my opinion, have been in the realm of better drilldown, ala QlikView and then Tableau. Drilldown, of course, is the main UI for business analysts and end users to subset data themselves.
*I used the word “visible” on purpose. The advances at the back end have been enormous, and much of that redounds to the benefit of BI.
3. I wrote 2 1/2 years ago that sophisticated predictive modeling commonly fit the template:
- Divide your data into clusters.
- Model each cluster separately.
That continues to be tough work. Attempts to productize shortcuts have not caught fire.
Monitoring
A huge fraction of analytics is about monitoring. People rarely want to frame things in those terms; evidently they think “monitoring” sounds boring or uncool. One cost of that silence is that it’s hard to get good discussions going about how monitoring should be done. But I’m going to try anyway, yet again. 🙂
Business intelligence is largely about monitoring, and the same was true of predecessor technologies such as green paper reports or even pre-computer techniques. Two of the top uses of reporting technology can be squarely described as monitoring, namely:
- Watching whether trends are continuing or not.
- Seeing if there are any events — actual or impending as the case may be — that call for response, in areas such as:
- Machine breakages (computer or general metal alike).
- Resource shortfalls (e.g. various senses of “inventory”).
Yes, monitoring-oriented BI needs investigative drilldown, or else it can be rather lame. Yes, purely investigative BI is very important too. But monitoring is still the heart of most BI desktop installations.
Predictive modeling is often about monitoring too. It is common to use statistics or machine learning to help you detect and diagnose problems, and many such applications have a strong monitoring element.
I.e., you’re predicting trouble before it happens, when there’s still time to head it off.
As for incident response, in areas such as security — any incident you respond to has to be noticed first Often, it’s noticed through analytic monitoring.
Hopefully, that’s enough of a reminder to establish the great importance of analytics-based monitoring. So how can the practice be improved? At least three ways come to mind, and only one of those three is getting enough current attention.
Cloudera’s Data Science Workbench
0. Matt Brandwein of Cloudera briefed me on the new Cloudera Data Science Workbench. The problem it purports to solve is:
- One way to do data science is to repeatedly jump through the hoops of working with a properly-secured Hadoop cluster. This is difficult.
- Another way is to extract data from a Hadoop cluster onto your personal machine. This is insecure (once the data arrives) and not very parallelized.
- A third way is needed.
Cloudera’s idea for a third way is:
- You don’t run anything on your desktop/laptop machine except a browser.
- The browser connects you to a Docker container that holds (and isolates) a kind of virtual desktop for you.
- The Docker container runs on your Cloudera cluster, so connectivity-to-Hadoop and security are handled rather automagically.
In theory, that’s pure goodness … assuming that the automagic works sufficiently well. I gather that Cloudera Data Science Workbench has been beta tested by 5 large organizations and many 10s of users. We’ll see what is or isn’t missing as more customers take it for a spin.
Categories: Cloudera, Hadoop, Market share and customer counts, Predictive modeling and advanced analytics | 5 Comments |
Coordination, the underused “C” word
I’d like to argue that a single frame can be used to view a lot of the issues that we think about. Specifically, I’m referring to coordination, which I think is a clearer way of characterizing much of what we commonly call communication or collaboration.
It’s easy to argue that computing, to an overwhelming extent, is really about communication. Most obviously:
- Data is constantly moving around — across wide area networks, across local networks, within individual boxes, or even within particular chips.
- Many major developments are almost purely about communication. The most important computing device today may be a telephone. The World Wide Web is essentially a publishing platform. Social media are huge. Etc.
Indeed, it’s reasonable to claim:
- When technology creates new information, it’s either analytics or just raw measurement.
- Everything else is just moving information around, and that’s communication.
A little less obvious is the much of this communication could be alternatively described as coordination. Some communication has pure consumer value, such as when we talk/email/Facebook/Snapchat/FaceTime with loved ones. But much of the rest is for the purpose of coordinating business or technical processes.
Among the technical categories that boil down to coordination are:
- Operating systems.
- Anything to do with distributed computing.
- Anything to do with system or cluster management.
- Anything that’s called “collaboration”.
That’s a lot of the value in “platform” IT right there. Read more
Categories: Business intelligence, Predictive modeling and advanced analytics, Public policy | 3 Comments |
Notes on anomaly management
Then felt I like some watcher of the skies
When a new planet swims into his ken
— John Keats, “On First Looking Into Chapman’s Homer”
1. In June I wrote about why anomaly management is hard. Well, not only is it hard to do; it’s hard to talk about as well. One reason, I think, is that it’s hard to define what an anomaly is. And that’s a structural problem, not just a semantic one — if something is well enough understood to be easily described, then how much of an anomaly is it after all?
Artificial intelligence is famously hard to define for similar reasons.
“Anomaly management” and similar terms are not yet in the software marketing mainstream, and may never be. But naming aside, the actual subject matter is important.
2. Anomaly analysis is clearly at the heart of several sectors, including:
- IT operations
- Factory and other physical-plant operations
- Security
- Anti-fraud
- Anti-terrorism
Each of those areas features one or both of the frameworks:
- Surprises are likely to be bad.
- Coincidences are likely to be suspicious.
So if you want to identify, understand, avert and/or remediate bad stuff, data anomalies are the first place to look.
3. The “insights” promised by many analytics vendors — especially those who sell to marketing departments — are also often heralded by anomalies. Already in the 1970s, Walmart observed that red clothing sold particularly well in Omaha, while orange flew off the shelves in Syracuse. And so, in large college towns, they stocked their stores to the gills with clothing in the colors of the local football team. They also noticed that fancy dresses for little girls sold especially well in Hispanic communities … specifically for girls at the age of First Communion.
Categories: Business intelligence, Log analysis, Predictive modeling and advanced analytics, Web analytics | 4 Comments |
“Real-time” is getting real
I’ve been an analyst for 35 years, and debates about “real-time” technology have run through my whole career. Some of those debates are by now pretty much settled. In particular:
- Yes, interactive computer response is crucial.
- Into the 1980s, many apps were batch-only. Demand for such apps dried up.
- Business intelligence should occur at interactive speeds, which is a major reason that there’s a market for high-performance analytic RDBMS.
- Theoretical arguments about “true” real-time vs. near-real-time are often pointless.
- What matters in most cases is human users’ perceptions of speed.
- Most of the exceptions to that rule occur when machines race other machines, for example in automated bidding (high frequency trading or otherwise) or in network security.
A big issue that does remain open is: How fresh does data need to be? My preferred summary answer is: As fresh as is needed to support the best decision-making. I think that formulation starts with several advantages:
- It respects the obvious point that different use cases require different levels of data freshness.
- It cautions against people who think they need fresh information but aren’t in a position to use it. (Such users have driven much bogus “real-time” demand in the past.)
- It covers cases of both human and automated decision-making.
Straightforward applications of this principle include: Read more