Is analytic data management finally headed for the cloud?
It seems reasonable to wonder whether analytic data management is headed for the cloud. In no particular order:
- Amazon Redshift appears to be prospering.
- So are some SaaS (Software as a Service) business intelligence vendors.
- Amazon Elastic MapReduce is still around.
- Snowflake Computing launched with a cloud strategy.
- Cazena, with vague intentions for cloud data warehousing, destealthed.*
- Cloudera made various cloud-related announcements.
- Data is increasingly machine-generated, and machine-generated data commonly originates off-premises.
- The general argument for cloud-or-at-least-colocation has compelling aspects.
- Analytic workloads can be “bursty”, and so could benefit from true cloud elasticity.
Also — although the specifics on this are generally vague and/or confidential — I sense a narrowing of the gap between:
- The hardware + networking required for performant analytic data management.
- The hardware + networking available in the cloud.
*Cazena is proud of its team of advisors. However, the only person yet announced for a Cazena operating role is Prat Moghe, and his time period in Netezza’s mainstream happens not to have been one in which Netezza had much technical or market accomplishment.
On the other hand:
- If you have processing power very close to the data, then you can avoid a lot of I/O or data movement. Many cloud configurations do not support this.
- Many optimizations depend upon controlling or at least knowing the hardware and networking set-up. Public clouds rarely offer that level of control.
And so I’m still more confident in SaaS/colocation analytic data management, or in Redshift, than I am in true arm’s-length cloud-based systems.
Comments
3 Responses to “Is analytic data management finally headed for the cloud?”
Leave a Reply
In my experience, machine-generated data is creating a strong economic driver toward fine-grained geo-federated data management rather than cloud per se. It makes sense to push a lot of the analytics all the way out to the edge, very close to the generating source, such that there is no meaningful cloud deployment model.
The driver is insufficient or expensive bandwidth between the location where it is generated and a sufficiently large data center, cloud or otherwise, where it can be aggregated in the raw and analyzed. Even with aggressive filtering at the edge, a petabyte per day of rich, operational data is not uncommon. There is a desire to analyze the data in place as part of a federated analytic rather than culling and backhauling, which loses too much context.
We work with a lot of very fast, very large machine-generated data sources primarily for the purposes of spatial analytics. Much of this is in the cloud on big commodity clusters. One of the most common asks by customers is if our platform can be deployed as a fine-grained geo-federated system so that many of the analytics can run adjacent to the collection platform to reduce bandwidth requirements.
For machine-generated (“IoT”), sensor, and other spatially organized data models this actually makes a lot sense because the topology of reality matches the topology of the spatial analysis algorithms. Spatial joins and aggregates across various data sources would have a lot of physical network locality when organized this way, which reduces analysis latency while increasing the effective bandwidth and practically supportable data volumes.
In many cases reserved cluster instances are used for big data analytic in amazon cloud.
I would argue that it is something in the middle between cloud and collocation, even closer to collocation.
Why it is similar to collocation:
– It is predefined hardware.
– It is guaranteed network (10 GB), but only when all servers allocated together in one placement group.
– It is too expensive to take on-demand, so it is used as reserved (less elastic).
Why it still resemble cloud:
– API for provisioning is available.
– Cloud services, like S3 and EBS are available.
Network bandwith is still the biggest obstacle to wider cloud analytics adoption. Moving very large volumes of data associated with analytics is all but impossible for most enterprises under current network infrastructure limitations. As Curt and Andrew said above, machine ( cloud ) generated data is best fit for cloud analytics, as data is already out there so it either has to/can be processed in situ or moved to cloud processing site.