Analytics on the edge?
There’s a theory going around to the effect that:
- Compute power is and will be everywhere, for example in cars, robots, medical devices or microwave ovens. Let’s refer to these platforms collectively as “real-world appliances”.
- Much more data will be created on these platforms than can reasonably be sent back to centralized/cloudy servers.
- Therefore, cloud-centric architectures will soon be obsolete, perhaps before they’re ever dominant in the first place.
There’s enough truth to all that to make it worth discussing. But the strong forms of the claims seem overblown.
1. This story doesn’t even make sense except for certain new classes of application. Traditional business applications run all over the world, in dedicated or SaaSy modes as the case may be. E-commerce is huge. So is content delivery. Architectures for all those things will continue to evolve, but what we have now basically works.
2. When it comes to real-world appliances, this story is partially accurate. An automobile is a rolling network of custom Linux systems, each running hand-crafted real-time apps, a few of which also have minor requirements for remote connectivity. That’s OK as far as it goes, but there could be better support for real-time operational analytics. If something as flexible as Spark were capable of unattended operation, I think many engineers of real-world appliances would find great ways to use it.
3. There’s a case to be made for something better yet. I think the argument is premature, but it’s worth at least a little consideration.
There are any number of situations in which decisions are made on or about remote systems, based on models or rules that should be improved over time. For example, such decisions might be made in:
- Machine vision or other “recognition”-oriented areas of AI.
- Detection or prediction of malfunctions.
- Choices as to what data is significant enough to ship back upstream.
In the canonical case, we might envision a system in which:
- Huge amounts of data are collected and are used to make real-time decisions.
- The models are trained centrally, and updated remotely over time as they are improved.
- The remote systems can only ship back selected or aggregated data to help train the models.
This all seems like an awkward fit for any common computing architecture I can think of.
But it’s hard to pin down important examples of that “canonical” case. The story implicitly assumes:
- A model is widely deployed.
- The model does a decent job but not a perfect one.
- Based on its successes and failures, the model gets improved.
And now we’re begging a huge question: What exactly is there that keeps score as to when the model succeeds and fails? Mathematically speaking, I can’t imagine what a general answer would be like.
4. So when it comes to predictive models executed on real-world appliances I think that analytic workflows will:
- Differ for different (categories) of applications.
- Rely in most cases on simple patterns of data movement, such as:
- Stream everything to central servers and sort it out there, or if that’s not workable …
- … instrument a limited number of test nodes to store everything, and recover the data in batch for analysis.
- Update models only in timeframes that you’re doing a full app update/refresh.
And with that much of the apparent need for fancy distributed analytic architectures evaporates.
5. Finally, and notwithstanding the previous point: Across many use cases, there’s some kind of remote log data being shipped back to a central location. It may be the complete log. It may be periodic aggregates. It may happen only what the edge nodes regard as significant events. But something is getting shipped home.
The architectures for shipping, receiving and analyzing such data are in many cases immature. That’s obvious if there’s any kind of streaming involved, or if analysis is done in Spark. Ditto if there’s anything we might call “non-tabular business intelligence”. As this stuff matures, it will in many cases fit very well with today’s cloud thinking. But in any case — it needs to mature.
Truth be told, even the relational case is immature, in that it can easily rely on what I called:
data warehouses (perhaps really data marts) that are updated in human real-time
That quote is from a recent post about Kudu, which:
- Is designed for exactly that use case.
- Went GA early this year.
As always, technology is in flux.
Related links
- Interana is another example of very new technology that seems applicable to these use cases.
- My 2013 post on the future of IT architectures still rings true.
Comments
2 Responses to “Analytics on the edge?”
Leave a Reply
Interesting to not (in regards to the cloud) that Amazon already has Amazon Greengrass for edge computing. https://aws.amazon.com/greengrass/
Thanks for this. I really like what you’ve posted here and wish you the best of luck with this blog and thanks for sharing.