Growth in machine-generated data
In one of my favorite posts, namely When I am a VC Overlord, I wrote:
I will not fund any entrepreneur who mentions “market projections” in other than ironic terms. Nobody who talks of market projections with a straight face should be trusted.
Even so, I got talked today into putting on the record a prediction that machine-generated data will grow at more than 40% for a while.
My reasons for this opinion are little more than:
- Moore’s Law suggests that the same expenditure will buy 40% or so more machine-generated data each year.
- Budgets spent on producing machine-generated data seem to be going up.
I was referring to the creation of such data, but the growth rates of new creation and of persistent storage are likely, at least at this back-of-the-envelope level, to be similar.
Anecdotal evidence actually suggests 50-60%+ growth rates, so >40% seemed like a responsible claim.
Related links
- My recent survey of machine-generated data topics started with a list of many different kinds of the stuff.
- My 2009 post on data warehouse volume growth makes similar points, and notes that high growth rates mean we likely can never afford to keep all machine-generated data permanently.
- My 2011 claim that traditional databases will migrate into RAM is sort of this argument’s flipside.
Comments
7 Responses to “Growth in machine-generated data”
Leave a Reply
I am wondering, if all this data is valuable enough to be stored? For example it might be hard to justify storage of temperature sensors data in one minute resolution for more then a few weeks weeks.
In other words – I am not sure that growing of amount of produced data will be reflected into growing amount of stored and analyzed data.
David,
It is very unlikely to all be stored. We couldn’t pay for storing it all today. As storage gets cheaper (Moore’s Law/Kryder’s Law), volumes will increase further (Moore’s Law/subject of this post). So if we can’t afford to keep everything now, we also won’t be able to afford doing so in the future.
That said, 1 minute temperature readings aren’t the best example, because those don’t really take a lot of volume.
I wouldn’t doubt the 40 percent growth rate considering the number of connected machines now generating data that was not previously collected. I agree with David that if the data isn’t store for future analysis, how are you leveraing value in the data? Is it immediately analyzed? Are summaries created and stored from collected data?
Peter Fretty, IDG blogger posting on behalf of SAS
One factor is affordability. Storage gets cheaper every year and we need to find a way to utilize them
Peter,
If the data isn’t all being stored, then summaries, highlights and/or samples surely should be.
Event detection is one term I’ve heard used in that connection. Another is data reduction, which is a different sense of the term than “choose the most useful variables on which to base a predictive model”.
It would be nice to be able to define “value per GB” measure for different types of data. Having graph of such value together with storage price graph would enable us to predict – what types of data will be stored in the future.
[…] have more data — presumably machine-generated — than you can afford to […]