“Big data” has jumped the shark
I frequently observe that no market categorization is ever precise and, in particular, that bad jargon drives out good. But when it comes to “big data” or “big data analytics”, matters are worse yet. The definitive shark-jumping moment may be Forrester Research’s Brian Hopkins’ claim that:
… typical data warehouse appliances, even if they are petascale and parallel, [are] NOT big data solutions.
Nonsense almost as bad can be found in other venues.
Forrester seems to claim that “big data” is characterized by Volume, Velocity, Variety, and Variability. Others, less alliteratively-inclined, might put Complexity in the mix. So far, so good; after all, much of what people call “big data” is collections of disparate data streams, all collected somewhere in a big bit bucket. But when people start defining “big data” to include Variety and/or Variability, they’ve gone too far.
Up to that point, Hopkins — while wrong — is far from alone. The less common part of his error is to further claim that for data to be “big”, it must be stored in a way that violates the C in the CAP Theorem. Yes, the bigger the data set, the more likely that each datum has low individual value, with immediate consistency not being strictly necessary. But there are plenty of big data use cases in which data accuracy turns out to be a good idea.
It actually is reasonable to say that Volume and Velocity of data go together. If you’re storing 5 terabytes of data per day, you have a “big data” kind of problem, whether you then keep it for 30 days or 3000. It also is reasonable to say that Variety and Variability go together; indeed, I’d guess that what Forrester means by those terms corresponds to multi-structured and poly-structured respectively, and using one of those terms is generally plenty.
But while we can whittle four concepts down to two, the reduction should stop there. I say this because any of four combinations is possible (and not just in edge cases):
- Data can be both big and poly-structured. For example, consider the classic Hadoop log-collection use case, or the bigger of MarkLogic’s databases, or of Splunk’s, or even the dynamic-schema parts of relational data warehouses built by Zynga and eBay. And yes, also consider some of the NoSQL-based short-request systems Hopkins was surely thinking of as well.
- Data can be both big and simply-structured. I think most of Teradata’s and Vertica’s petabyte-scale installations would fit that description, the partial counterexamples at eBay and Zynga notwithstanding.
- Data can be not-so-big and poly-structured. Consider, for example, a typical user of Intersystems Cache’.
- Data can be not-so-big and simply-structured. Consider, for example, most of the traditional RDBMS world.
To pretend that those four possibilities are only two — “big data” and otherwise — is a travesty.
If the term “big data” has become useless, then what? Gartner may have switched over to “extreme data,” as reported by my clients at SAND, in honor of the multi-V stuff. That would be an improvement. Better yet would be to stop pretending that a matrix with two dimensions has only one. If what you mean is “huge, poly-structured databases”, then that’s what you should say, or something like it.
If things are bad for “big data”, they’re even worse for “big data analytics”, a term that starts out by inheriting all of big data’s problems and adds more of its own. “Big data analytics” surely means “analytics done on big data” — but nobody’s quite sure what “analytics” are. For example:
- I’m OK with “analytic processing” incorporating all of what might be called business intelligence, visualization (which sometimes now is just the new term for BI), data mining, machine learning, predictive analytics (which for some years has been the term for data mining and machine learning), planning, and yet more. However, …
- … others don’t agree, and contrast “analytics” to “OLAP” and/or to “visualization”, and seem to equate “analytics” to “predictive analytics” or something similar.
- The latter is what most people have in mind when they say “big data analytics”, but …
- … vendors who can only lay claim to the “analytics” term in its most expansive sense claim to be doing “big data analytics” as well.
Nonsense even worse than Forrester’s ensues.
So here’s what I propose.
- Nobody should ever again say that “big data” doesn’t include big relational data warehouses.
- If your definition of “big data” goes beyond Volume and perhaps Velocity to include Variety, Variability, or Complexity — please call it something else instead. “Extreme data” sounds like a snowboarding competition or something, but at least it’s not as totally erroneous as “big”.
- Never, ever use the phrase “big data analytics” unless you have modifiers near it, to show what kind of big data analytics you’re talking about, or at least to describe the special value you think you bring to the big data analytics process.
Edit: Merv Adrian of Gartner Group has a more reasonable — and wittier! — take than Forrester’s:
You won’t see us telling people “That’s not #bigdata. This is big data.” That’s Crocodile Dundee’s job.
Comments
39 Responses to ““Big data” has jumped the shark”
Leave a Reply
Curt, while initially I agreed with your thoughts here entirely, in thinking about it more it seems I am on the fence.
Specifically, in looking at the meaning of “big data” we should focus on its origin. My understanding is that big data always meant “processing at a scale beyond the capabilities of a traditional (relational) database.”
The part on RDBMS is important as the first big data solutions were also some flavor of NoSQL. I could be mistaken there, but I’m fairly sure of that.
So I guess what I’m trying to say is: I agree that petascale, parallel data warehouses should be included as ‘big data solutions’, but I am confused as to why the 3 or 4 Vs do not hold as characteristics of BD. Is it not true that the newer EDW have some sort of polystructured capability, even if it’s rudimentary or early stage?
I do agree with your thoughts on ‘analytics’ – wholly marketing spin.
Thanks for doing what you do.
Huh and really?
You refute with no logical counter and you blog lacks coherence. If you would like to compare facts and discuss the logic that I base my conclusions on, I’d be happy to. I’ve spent hundreds of hours and talked to at least 30 companies on the topic of big data and my assertions will stand.
This vitriolic nonsense you dishonored your blog with does not become your credentials.
Brian, when will you and others in the press learn that quantity is NOT a synonym for quality?
Couldn’t care less about your “30 companies”, or if they were 10000. What you said makes no sense. No one mentioned the “companies”.
Joe,
Your understanding is wrong. Try googling on the phrase “big data” in a time period ending some years ago. You’ll find that most of the references are to tasks that a scale-out relational analytic DBMS is meant to and would seem to be the right technology to handle.
Since then I can confirm the same from personal memory.
Over the past few years use of the term “big data” has EXPANDED. But it’s bizarre to say that the term no longer applies to the very situations it was invented (or popularized) to cover.
Whenever you try to define a category, you engage in a dangerous exercise of fuzzy semantics. I’m reminded of the Aboriginal language Dyirbal, which defines a category called “balan” which includes “fire, women and dangerous objects,” but also birds that are not dangerous.
Terms like BigData are boundary objects, artifacts that are shared across different communities but understood differently. Having a shared understanding of red and green traffic lights is good, but of this term BigData, what difference does it make if Curt feels if can encompass relational databases or not?
I don’t understand the abrasive tone of Mr. Hopkins.
Neil,
I gather you’ve read some George Lakoff. 🙂
Brian,
I’m not objecting to your judgment about actual or desirable trends in the use of technology. In the few paragraphs I read, they seem reasonable, whether or not I disagree about the particulars. What I’m objecting to is your irresponsible use of language.
Marketers often see it as their job to deliberately confuse people, by changing the meaning of popular product-category terms in ways that make their product look more appealing. I try to talk my vendor clients out of the more egregious examples of that, telling them that creating so much confusion is usually counterproductive. But at least they have an excuse; it’s what marketers do.
Analysts, however, have no excuse for deliberately misleading their audiences, except in the narrowest of cases. (E.g., when keeping something confidential even at the cost of giving an incorrect impression.)
When you pretend that the term “big data” does not extend to the areas it was pretty much defined for in the first place, you’re being misleading.
I guess unless a term is defined in the dictionary, then it is open for debate and interpretation. BTW, I checked the Webster dictionary but it matched “Big Data” with “Big Daddy”, I checked and the definition partly fitted but the paternalistic authority bit didn’t make sense….
Big Data, to me, just means that. A smeg load data that is challenging to manage. And what actually a “smeg load” is changes constantly as our methods to manage greater data volumes increases. For me, I don’t see how the term “Big Data” has any technical boundaries associated with it. Perhaps I like to keep things too simple? It took me a while to settle on this simple viewpoint, I used to be more associated with certain technologies but at some point I realized that Big Data is an ongoing challenge which will remain regardless of current technical solutions. I have probably contradicted myself over the years many times though!
It appears Forrester’s analyst is using “Big Data” to categorize data management technologies other than traditional RDBMS for discussion with the enterprise CIO’s. Interestingly some of these are already suffering from over categorization with NoSQL, NewSQL, PurpleSQL… The discussion with the enterprise CIO’s is a good thing as many I know are unfamiliar with these, I can’t argue with the use of a grouping term to keep things simple and define some context for discussion, but proposing “Big Data” as a category definition for a subset of data management technologies at the exclusion of others I don’t agree with, well not today at least.
@Tony:
> “And what actually a “smeg load” is changes constantly as our methods to manage greater data volumes increases. For me, I don’t see how the term “Big Data” has any technical boundaries associated with it.”
Well, I thought a bit about this too. On the one hand, if we now have some volume of data to deal with that exceeds the volume we can comfortably process by a considerable factor, we can say we now have a Big Data problem, which is likely to get solved in the not too long term due to technological advances.
But what if we anticipate that the data volume will grow in the mean while too at a pace that exceeds the expected pace of data processing and analysis improvements? Then we can say we will still have a Big Data problem in the future.
(There could be multiple causes for this constant growth of data volume – your social network may keep growing, or the devices you use to do your measurements may also improve throughput, precision, sampling rate or whatever over time)
kind regards,
Roland.
Roland,
Of course. We architect for where things are going, not just for the needs identified today. That’s why, for example, I push users towards systems that can comfortably handle their current workloads, not ones that already have to stretch.
Often the biggest changes aren’t just in data volume, but also in depth of analysis and/or speed in returning analytic results. That’s part of the planning process as well. And it kind of justifies the “Velocity” point — crunching the same data 100X as quickly can be just as challenging as crunching 100X more data.
Curt,
All of it I think. And Susan Leigh Star.
Thanks all… This is good dialog. Apologize for earlier tone, but the post that launched this discussion was making errors about what I said and was not written as coherent counter argument.
@curt: Part of your posts acusation is that I claim somehow big data violates CAP. Guess that comment kind of caused me to write off your whole post because it told me you didn’t really read what I said, but rather were interested in analyst bashing which we all get enough of already. Understand your point about origin of term, I’m looking in to it.
@all – regarding not defining things. Apologies but I have the exact opposite view. Our clients pay us to help them make sense of a confusing tech world, so if we shy away form defining new industry buzz words like big data out of some purist notion that we are contiributing to the confusion, then we do them no service. They want to know what we think it means, and we have a duty to take a stab at it, even though we might need to adjust later.
@noons – quantity is important to some extent, else how can one claim thoroughness? If I ask someone what big data means who has spoken with no one and done no research, the ask another who has done extensive research, who is more credible?
Last – the original post of mine that started this was a blog, it was not “research”. I differentiate – our research process is rigorous, and I am subject to lots of scrutinty and internal criticism. Once we release report, then one can say, “Forrester’s Defn of big data is crap” – even though I’d like us to be more professional than that. Right now, that blog is my current opinion and I’m open to be influenced by discussions such as these.
Cheers all (I really need a keyboard for my iPad, geez)
To paraphrase Justice Potter Stewart, “I can’t define it but I know it when I see it.”
Of course he was referring to pornography but it does seem to apply to Big Data 🙂
Brian,
At no time have I made reference to violating the CAP Theorem; I don’t even know what that would mean, given that it has in fact been proven as a theorem.
Rather, I referred to violating the C in CAP Theorem.
I also don’t know why you’d think I’d engage in “analyst bashing”, given how long I’ve been an analyst. 🙂
As for the rest — tone happens. 😉
so would Brian like to explain where Hadoop falls in the CAP spectrum and whether it qualifies to be big data? I assume we would have to analyze HDFS there. Hbase (for which the CAP theorem is probably more relevant) picks CA over AP (unlike Cassandra) so maybe thats not big data either. Why you would bring CAP into the ‘analysis’ side in big data ( as opposed to the OLTP side where all the noise and sounds is) escapes me but what do I know.
IMHO – the CA vs AP discussion makes significantly more sense for write heavy apps where you dont want to fail writes – ever. For the analysis side which is basically read heavy it seems a little odd that this would be important
Susan,
I used that line in a PaineWebber report about the artificial intelligence industry in the mid 1980s. The copy editor merely insisted that we identify WHICH justice it was, and did the research himself.
Best copy editor I’ve ever worked with …
As a minimalist I like the definition in the first comment: too big to be reasonably handled by current/traditional technologies.
Most of the time when vendors, at least, try to add something to a definition, they are doing it to add their angle or to define their unique-thing into the definition. Ergo, I am always suspect and usually minimalist. Big data is big data. QED. Big is defined relative to existing technology.
By the way, I must say I’d forgotten about “jumping the shark” as an idiom and loved the Wikipedia entry on it: http://en.wikipedia.org/wiki/Jumping_the_shark
To add to what Tony and Dave have said – it’s hard for the term ‘Big Data’ to keep a constant definition because things are changing. What was once considered Big Data is not Big Data anymore (and some is).
Technology shifts have had a lot to d with this. Commodity 64-bit computers, equipped with an intel I7 can handle monstrous amounts of data, compare to 10 years ago.
We are still in the midst of this shift, and I think it will take everyone time to adjust. Vendors, buyers and analysts.
I agree with Dave’s definition. Concise and enduring.
Is this the first time something has jumped the shark prior to crossing the chasm? I always thought it was mainstream users who spoiled the party, not mainstream analysts 😉
I’m happy to see you found my understanding of the definition agreeable, Dave. As you, Elad, and Tony elaborated on: it’s all relative – in this case, to commonplace/accessible technologies.
Curt, I acknowledge that you are correct in the origins of the term, and reiterate that I also agree that EDW should very well be included under the big data umbrella.
I am still somewhat unclear though why the 4 Vs ought not be considered valid characteristics of big data solutions, given the evolving/relative definition of the term. In particular, it seems you seem comfortable with the velocity and volume, but not variety/variability.
WRT variety, what leading big data solutions today do not have some sort of capability for storing and retrieving ‘various’ (un/polystructured?) data? Quickly glossing over what’s out there today, it looks like all of them do.
Could you elaborate on your aversion toward accepting these characteristics? Is it only because the term has expanded from its original definition? If so, that doesn’t necessarily seem to be a bad thing.
I don’t what to get caught up on this, as to me your point about data warehouses seems to be your main one, but I do want to see if the 4 Vs really are something distinctly different than “big data”. If nothing else because I for one will adamantly refuse to use the term “extreme data” out of principle =]
Joe,
My point is that the other two Vs are interestingly common characteristics of “big data”, but they aren’t — or shouldn’t be — relevant to the DEFINITION of the term.
Let’s face it, big data is a marketing buzzword that has gained traction in the market place. Once this happens every vendor jumps on the bandwagon and any attempt to create a clear definition for big data is doomed to failure. At the same time, big data does have value because it represents the ability to support applications that were not possible before. I understand where Curt is coming from on the “Vs” but to the workload is closely tied to the term. Extreme computing may be a better term and was used for a while, but big data has taken on a life of its own. I agree with Curt that big analytics is not a good term, anymore than advanced analytics was.
My main primary objection in this discussion is Brian Hopkin’s statement that relational appliances have no part in big data. He gives zero justification for this statement other than he works for Forrester and supposedly has spoken to many people. Having spent 40 years designing and deploying database systems of many different types this statement blows my mind. I think the statement shows a complete lack of understanding of database technology. This statement confuses the industry and Forrester should disavow this statement because it is embarrassing.
The “Big Data” conversation is starting to become an exercise in counting angels on a pinhead with diverging interpretations depending on perspective – technical, analyst, marketing, etc.
What I like about the term is its resonance of similar “Bigs” like “Big Oil” and a nod in the direction of IBM with “Big Iron”. I am all in favour of your usage guidelines too.
A very interesting collection of thoughts from all protagonists involved though, and thanks, Curt, for enriching my vocabulary with the concept of the “Shark Jumping Moment”. I had to look it up!
[…] But in any case, I’d expect topics discussed at XLDB to be what even I might willingly label “big data”. Categories: Scientific research Subscribe to our complete […]
[…] Positioning all this as something to do with “big data” (what a shock). […]
This has been a great dialog in terms of promoting the awareness and issues. Wanted to ensure I gave all a tx for participating. Want to clear a few things up, however.
1. The piece that Curt reacted to was not a formally published “Forrester Report”, it was my own opinion on a blog. Please, let’s keep that in mind. The call for retraction that somebody made in the comments is a bit of over-reaction. Nobody has all the answers, big data is changing to fast at the moment.
2. I didn’t go into my logic for claiming that big data is not MPP EDW, true. Here is is – 1) data is going to get much bigger than it is today, 2) any system that claims to not force CAP tradeoffs isn’t big data because according to the theorem, sooner or later you will run into CAP problems if you scale big enough, 3) see #1 – we will get that big, then what will MPP EDWs do then? IOW, MPP EDWs spend a lot of effort trying to ensure that C, A, and P are all maintained. Sooner or larger, scale will not not let them.
Last, Forrester’s position on the inclusion or not of MPP EDW into big data will likely come in the form of a Tech Radar that breaks down the component technical capabilities. It will be the result of much internal debate, study and solicting industry opinion. Further, it likely will include MPP EDWs, despite my previously stated opinion.
Last, what’s more important than my opinion an definition, is the advice we are giving our clients regarding the “what is big data” question – you have to come up with something that makes sense for you so you can wade through the hype and make the right decisions about what to do.
Brian,
Thanks for the followup!
I disagree with you, however, on the relevance of CAP to large-scale analytics. It is easy to conceive of use cases in which a short-term problem in writing data wouldn’t be a serious problem for analytic work. In particular, I’d say that’s true for a large fraction of MPP relational analytic DBMS use cases, now and in the future. And to the extent exceptions arise, they’ll typically be for relatively smaller parts of the overall enterprise data inventory.
I’d further question the force of your “databases are getting bigger” argument. After all, computers are getting bigger too. In fact, for human-generated data (as opposed to machine-generated), in many cases hardware’s capability grows FASTER than databases do. So certain problems, if they’re not present today, will likely also not be present in the future.
This is why I’ve hammered so hard on the machine-generated/human-generated distinction. Problems we have with machine-generated data today WILL in all likelihood persist into the future, because the ability to generate the data is growing just as fast as the ability to process it, software improvements aside. (I.e., both sides of the comparison are influenced by Moore’s Law.)
Great to see the industry finally get on board with the volume-variety-velocity concept I created at Gartner over ten years ago. Tho it’s humorous and somewhat sad to see other research orgs claim the idea as theirs. Anyone interested in the original 2001 article on the 3Vs (“Three Dimensional Data Challenge”), feel free to reach me. -Doug Laney, VP Research, Gartner
For future reference, here’s a copy of the original Gartner (then Meta Group) piece I wrote 11 years ago first positing the 3Vs, “Three Dimensional Data Challenge: Controlling Volume, Velocity and Variety”: https://www.sugarsync.com/pf/D354224_7061872_35276
Re Big Data, my take: Big Data is merely data that’s an order of magnitude greater than you’re accustomed to…Grasshopper. –Doug Laney, VP Research, Gartner
[…] debates about whether a particular name suits its category are rampant. Here is a link to one such argument about the term “big data” from Curt Monash, an analyst whom I respect a great deal. This debate rages in the Twittersphere […]
[…] about what exactly is Big Data, which I’m not going to get into here. Some big thinkers have weighed in on Big Data (worth it for Merv Adrian’s quote at the end) and I had an interesting discussion recently […]
[…] I don’t dislike the term “big analytics” nearly as much as I do the term “big data“. […]
[…] will cement internet search squarely in the world of — for once I approve of the term — big data. Categories: Autonomy, Coveo, Endeca, Enterprise search, FAST, Google, Lucene, Mercado, […]
[…] and the cloud” to at least the same extent DeepDB is. However, if we’re interpreting “big data” to include multi-structured data support — well, only half or so of the NewSQL products and […]
Interesting. Personally what I’m interested in is learning what organizations are doing with their “big data.” Where are they storing it, how is it managed, and how can they improve their data storage operations.
[…] generalizations about “Big Data” are false. “Big Data” is a horrific catch-all term, with many different […]
[…] Back in 2011, I ranted against the term Big Data, but expressed more fondness for the V words — Volume, Velocity, Variety and Variability. […]
[…] I dislike the term “big data” but like the Vs that define it — Volume, Velocity, Variety and Variability. […]
[…] in apps of the sort for which Oracle’s RDBMS is not well-suited. Much of “Big Data” fits that […]