Things I keep needing to say
Some subjects just keep coming up. And so I keep saying things like:
Most generalizations about “Big Data” are false. “Big Data” is a horrific catch-all term, with many different meanings.
Most generalizations about Hadoop are false. Reasons include:
- Hadoop is a collection of disparate things, most particularly data storage and application execution systems.
- The transition from Hadoop 1 to Hadoop 2 will be drastic.
- For key aspects of Hadoop — especially file format and execution engine — there are or will be widely varied options.
Hadoop won’t soon replace relational data warehouses, if indeed it ever does. SQL-on-Hadoop is still very immature. And you can’t replace data warehouses unless you have the power of SQL.
Note: SQL isn’t the only way to provide “the power of SQL”, but alternative approaches are just as immature.
Most generalizations about NoSQL are false. Different NoSQL products are … different. It’s not even accurate to say that all NoSQL systems lack SQL interfaces. (For example, SQL-on-Hadoop often includes SQL-on-HBase.)
“Big Data” doesn’t create rapid IT growth. If we only had traditional kinds of data, IT growth would be drastically negative, since Moore’s Law swamps traditional data growth. Whole new categories of data are always needed to fill the gap. And these days, they’re all categorized as “Big Data”.
The single central database is a myth. Things are never that simple, at least at large enterprises. Hence, in particular, the ideal EDW (Enterprise Data Warehouse) is a myth.
Analytic RDBMS and appliances aren’t necessarily expensive. Deals can be had. Yes, most vendors want at least a few hundred thousand dollars for most sales, but there are plenty of exceptions even to that rule. And at either large or small scales, things get very cheap, for example:
- Various vendors’ free/”community” editions.
- The $2 million/petabyte hardware+software price I published for Vertica.
And Infobright is typically an economical option inbetween those extremes, if you’re cool with its focus on machine-generated data.
Columnar relational DBMS are relational. Examples include Sybase IQ, Vertica, ParAccel, Infobright and numerous others.
Yes, that’s a tautology. Even so, distressingly many people forget it, columnar RDBMS vendor employees not excepted.
Amazon Redshift proves very little about ParAccel. Amazon bought some stock in ParAccel, and got a cheap license to a subset of ParAccel’s code, perhaps in the same deal. Big whoop. Yes,
- It is claimed that there are a lot of Redshift users, I presume low-end ones.
- ParAccel is fast.*
But none of that speaks to some profound, ongoing Amazon/ParAccel/Actian relationship.
*I hear that ParAccel is usually faster than Vertica and other alternatives in POCs/benchmarks (Proofs of Concept). But I also hear that ParAccel’s installation complexity continues to be a POC problem.
New technology in old categories of application will only be adopted as quickly as firms replace their apps. Yes, that’s a tautology too. Even so, it puts an upper bound on, for example, the speed with which on-premises applications will be replaced by cloud alternatives.
SAP HANA is not yet a serious OLTP (OnLine Transaction Processing) DBMS. Yes,
- HANA has in some form been under development for a long time; its major antecedent is BI Accelerator, which shipped back in 2006.
- RAM-centric processing makes sense.
- HANA has a cool-sounding feature list.
- SAP claims lots of HANA sales, and not just in conjunction with a few new SAP apps that require HANA to run.
But the stories of HANA sales and deployment momentum sure seem concentrated on analytic use cases. And by the way — even among analytic DBMS vendors, I don’t hear much emphasis on competing vs. HANA.
Current BI trends reflect 1990s deja vu. The hottest business intelligence products and vendors are adopted by departments, on the strength of their snazzy interfaces and short adoption cycles.* That’s exactly how BI spread in the 1990s, only now the word “visualization” gets used more.
*A common phrase for that is land-and-expand.
And finally,
I’m not impressed that your future products will in some small ways be superior to what your competitors have had in production for over a year.
Comments
10 Responses to “Things I keep needing to say”
Leave a Reply
Curt, this is a great post. It competently cuts through an enormous amount of rhetoric in the industry that I see every day. Well done.
Cheers!
Thanks!
I’m ever trying to narrow the gap between what I write and what I say in (non-NDA) conversation. My back-to-back panel & video last week helped bring that into particular focus.
Good to see all ‘Curt-isms’ in a single post…
Your point about new technologies in old categories is very important (and often overlooked).
If I were a traditional analytics DBMS vendor, I would be quite concerned about Amazon Redshift. The value proposition is that compelling (the AWS folks often tell a story of meeting with a prospective customer where the customer begins laughing when told that you can resize a cluster in 15 minutes with Redshift vs the 6+ months it takes with their existing on-premise solution due to hardware procurement, provisioning, re-optimization, etc.)
For SQL on hadoop, you should really take a look at lingual: http://www.cascading.org/2013/08/07/lingual-an-introduction/
You should do this semi-annually.
Ken,
I’d hardly say those are ALL of them. 🙂
And it doesn’t take long to resize a Netezza cluster. Price comparisons may be a different matter.
[…] technologies tend to be adopted departmentally. I reiterated that for the specific case of BI in my “Things I keep needing to say” this […]
Curt, what do you mean by “resize Netezza cluster”?
Netezza has a “buy big new box, throw small old box away” selling model
Pavel,
I’ll confess to not keeping track of exactly how long you can go between Netezza purchases and still have the boxes work together. I imagine it’s less than Teradata’s “investment protection” is good for. On the other hand, it’s more than zero.