January 8, 2012
Big data terminology and positioning
Recently, I observed that Big Data terminology is seriously broken. It is reasonable to reduce the subject to two quasi-dimensions:
- Bigness — Volume, Velocity, size
- Structure — Variety, Variability, Complexity
given that
- High-velocity “big data” problems are usually high-volume as well.*
- Variety, variability, and complexity all relate to the simply-structured/poly-structured distinction.
But the conflation should stop there.
*Low-volume/high-velocity problems are commonly referred to as “event processing” and/or “streaming”.
When people claim that bigness and structure are the same issue, they oversimplify into mush. So I think we need four pieces of terminology, reflective of a 2×2 matrix of possibilities. For want of better alternatives, my suggestions are:
- Relational big data is data of high volume that fits well into a relational DBMS.
- Multi-structured big data is data of high volume that doesn’t fit well into a relational DBMS. Alternative: Poly-structured big data.
- Conventional relational data is data of not-so-high volume that fits well into a relational DBMS. Alternatives: Ordinary/normal/smaller relational data.
- Smaller poly-structured data is data for which dynamic schema capabilities are important, but which doesn’t rise to “big data” volume.
Notes on all this include:
- “Relational big data” is commonly what you need a scalable analytic relational DBMS for. But there are non-analytic use cases as well.
- The paradigmatic example of “multi-structured big data” is log files. Thus, multi-structured big data is commonly what you need a big bit bucket for.
- One might want to equate non-analytic relational big data technology to “NewSQL”. However, I’m struggling to think of a database size range in which the entire NewSQL industry can match Oracle’s market share alone.
- One might want to equate non-analytic multi-structured big data technology to “NoSQL”. However:
- “NoSQL” is also used to encompass not-so-big-data use cases, such as prototyping in MongoDB.
- “NoSQL” has non-ACID/low(er)-data-integrity connotations that aren’t appropriate for all non-relational systems.
- Up to a point, you can analyze relational big data in a conventional relational DBMS, but an analytic RDBMS will usually win on TCO (Total Cost of Ownership). In particular, reasonable thresholds for moving an analytic database off Oracle might be:
- 1-2 terabytes if you’ve never bought anything past Oracle Standard Edition.
- 5-10 terabytes if you’re already paying for Oracle Enterprise Edition.
- A lot higher than that if you actually find Oracle Exadata to be cost-effective.
- Depending on how big one acknowledges as “big”, the market share leader in “big bit bucket” use cases is either Splunk or Hadoop.
- If we look at multi-structured big data management overall, MarkLogic joins the list of market share contenders, as do various NoSQL alternatives.
- It is wrong to say that the large web companies invented “big data” technology. But it is more reasonable to say they invented much of “multi-structured big data” management. In particular (and this is just a partial list), Google, Amazon, Yahoo, Facebook, et al. can reasonably be credited with Hadoop, Cassandra, HBase and various predecessors to same.
Categories: Cassandra, Data models and architecture, Data warehousing, Exadata, Facebook, Google, Hadoop, HBase, Log analysis, Market share and customer counts, MarkLogic, NewSQL, NoSQL, Oracle, Splunk, Yahoo
Subscribe to our complete feed!
Comments
10 Responses to “Big data terminology and positioning”
Leave a Reply
[…] make things more complex than they need to be. Then in response others (eg Monash Research on Jan 8 here) dialled it back to just two dimensions (volume and […]
Curt, great piece as always with good math-proof-like concept breakdown.
One other dimension I wanted to add is Analytics, which we see as key to defining big data properly. This is important in two separate ways:
a) you need more than SQL to analyze big data. MapReduce is key here. This is something that a lot of “Big Data” vendors don’t really have and they will often tell you that you don’t need (because they can’t provide natively).
b) If you see the most successful applications of Big Data, it is all about discovery and investigative analytics (http://www.dbms2.com/2011/03/03/investigative-analytics/). It’s about quickly collecting and analyzing different multi-structured data sources (e.g. customer interaction data, mobile data, web data) and discovering where the gold is hidden without breaking the bank with too much process or people. It also means a “fast-fail” approach where if a combination of data + analytics don’t produce what you hypothesize, you quickly move on to explore other opportunities. This is very different from the old-school waterfall BI model where a well defined question (e.g. “how many sales I have by product and region) is converted to Data, ETL and BI reports through a well-defined process.
I would argue that a system or deployment is not Big Data if it doesn’t have both beyond-SQL capabilities and an investigative/discovery approach on how it performs its Analytics.
Thanks,
Tasso Argyros
co-president, Teradata Aster
Tasso,
Those are great observations about “dealing with”, “managing”, or “exploiting” Big Data — but what do they have to do with “defining” the term?? 🙂
Best,
CAM
Curt, if you believe that Big Data is only about the data, you’re right. If you think that the analytics are critical component of “Big Data” they ought to be in the definition.
E.g. I can put multi-sturctured data in almost any relational database, no problem. They can all be stored in BLOBs. But if I can’t analyze it properly (say, with MapReduce), is it really Big Data? I think not.
Another way to look at this is that I was just elaborating on what “fits well” mean in your definition of Big Data. It’s again not about the storage features but the analytical capabilities.
Best,
Tasso
Tasso,
I think analytics are something one does to data. I don’t think they’re a “component” of the data.
More directly, I’ve noted multiple times in the past that Aster makes a case for getting multi-structured data into a relational database sooner than non-Aster alternatives would seem to suggest, and that the difference is based on analytics. But I don’t think that’s a good enough reason to make the terminological mess even worse than it already is.
I agree that the problem of poly-structured data is not new at all and I think has little to do with “Big Data”. Its a problem for even the smallest databases. In a relational model you have to design the database schema with your “use cases” in mind and if you decide you want to store some other data well then you have to rework the db.
May I suggest calling it “Hairy Data” instead?
[…] is really Oracle’s multi-structured big data appliance. Oracle’s relational big data appliance is Exadata, which has been out for years and has […]
[…] Big data (analytics) — I just discussed that mess a week ago. […]
[…] the same extent DeepDB is. However, if we’re interpreting “big data” to include multi-structured data support — well, only half or so of the NewSQL products and companies I know of share […]
Big Data…
The technology team is investigating the potenti…