Challenges in anomaly management
As I observed yet again last week, much of analytics is concerned with anomaly detection, analysis and response. I don’t think anybody understands the full consequences of that fact,* but let’s start with some basics.
*me included
An anomaly, for our purposes, is a data point or more likely a data aggregate that is notably different from the trend or norm. If I may oversimplify, there are three kinds of anomalies:
- Important signals. Something is going on, and it matters. Somebody — or perhaps just an automated system — needs to know about it. Time may be of the essence.
- Unimportant signals. Something is going on, but so what?
- Pure noise. Even a fair coin flip can have long streaks of coming up “heads”.
Two major considerations are:
- Whether the recipient of a signal can do something valuable with the information.
- How “costly” it is for the recipient to receive an unimportant signal or other false positive.
What I mean by the latter point is:
- Something that sets a cell phone buzzing had better be important, to the phone’s owner personally.
- But it may be OK if something unimportant changes one small part of a busy screen display.
Anyhow, the Holy Grail* of anomaly management is a system that sends the right alerts to the right people, and never sends them wrong ones. And the quest seems about as hard as that for the Holy Grail, although this one uses more venture capital and fewer horses.
*The Holy Grail, in legend, was found by 1-3 knights: Sir Galahad (in most stories), Sir Percival (in many), and Sir Bors (in some). Leading vendors right now are perhaps around the level of Sir Kay.
Difficulties in anomaly management technology include:
- Performance is a major challenge. Ideally, you’re running statistical tests on all data — at least on all fresh data — at all times.
- User experiences are held to high standards.
- False negatives are very bad.
- False positives can be very annoying.
- Robust role-based alert selection is often needed.
- So are robust visualization and drilldown.
- Data quality problems can look like anomalies. In some cases, bad data screws up anomaly detection, by causing false positives. In others, it’s just another kind of anomaly to detect.
- Anomalies are inherently surprising. We don’t know in advance what they’ll be.
Consequences of the last point include:
- It’s hard to tune performance when one doesn’t know exactly how the system will be used.
- It’s hard to set up role-based alerting if one doesn’t know exactly what kinds of alerts there will be.
- It’s hard to choose models for the machine learning part of the system.
Donald Rumsfeld’s distinction between “known unknowns” and “unknown unknowns” is relevant here, although it feels wrong to mention Rumsfeld and Sir Galahad in the same post.
And so a reasonable summary of my views might be:
Anomaly management is an important and difficult problem. So far, vendors have done a questionable job of solving it.
But there’s a lot of activity, which I look forward to writing about in considerable detail.
Related link
- The most directly relevant companies I’ve written about are probably Rocana and Splunk.
Comments
8 Responses to “Challenges in anomaly management”
Leave a Reply
Interesting Post
I wonder if you could speak to the issues of:
– timeliness of response, as I would think that is part of the “holy grail.” This is part of the argument for complex event processing, when the anomalies are in streaming data, you use CEP to detect and act quickly.
– also, the extent to which machine learning helps systems improve their ability to 1) detect anomalies and 2) distinguish between signal and noise.
– Is there not a class of anomalies where we indeed know what they’ll be, but we don’t know when they’ll happen. (hence your Rumsfeld reference).
Machine learning is very important, since we can have a lot of different subsets of data with different behavior. We have to “learn it”, instead of configuring.
Finding anomalies which we see first time also can be addressed by all kinds of “single class” classification algorithms. I tend to see that it is hard to do with ML, but I believe it is even harder to solve without…
Re timeliness:
I’d say that the detection should be as timely as possible. But what’s possible depends on, for example, whether the anomaly is a single event or a deviation from a “typical” number of events per minute.
I would rephrase that detection should be as soon as there are statistical (or other) grounds to detect it.
The problem with machine learning in general and anomaly detection specifically is the necessary training/re-training cycle. The end-result from training the data with the ML algorithm is a static (or batch) model. Once new data arrives, the model has to be re-trained. Lather, rinse and repeat. There is an obvious disconnect between data pouring in and a detection system that is trained on “old” data.
This is also the reason why ML cannot be integrated into enterprise software systems. You cannot stop the ERP system while the ML system is re-trained.
I completely agree that practical application of ML is challenging, but still possible for the Anomaly detection. We can try to build general enough models which work well even if a bit outdated – it is one way.
Better way is to have daily, hourly and more fine grained models and we can aggregate and use combination of them “on demand”.
Specifically it makes sense when there are strong bias to specific time frames like week days, rush hours, etc.
[…] In June I wrote about why anomaly management is hard. Well, not only is it hard to do; it’s hard to talk about as well. One reason, I think, is […]
[…] In June I wrote about why anomaly management is hard. Well, not only is it hard to do; it’s hard to talk about as well. One reason, I think, is that […]