DBMS product categories
Analysis of database management technology in specific product categories. Related subjects include:
Couchbase 2.0
My clients at Couchbase checked in.
- After multiple delays, Couchbase 2.0 is well into beta, with general availability being delayed by the holiday season as much as anything else.
- Couchbase (the company) now has >350 subscription customers, almost all for Couchbase (the product) — which is to say for what was known as Membase, which is basically a persistent version of Memcached.
- There also are many users of open source Couchbase, most famously LinkedIn.
- Orbitz is a much-mentioned flagship paying Couchbase customer.
- Couchbase customers mainly seem to be replacing a caching layer, Memcached or otherwise.
- Couchbase headcount is just under 100.
The big changes in Couchbase 2.0 versus the previous (1.8.x) version are:
- JSON storage, including secondary indexes.
- Multi-data-center replication.
- A back-end change from SQLite to a heavily forked version of CouchDB, called Couchstore.
Couchbase 2.0 is upwards-compatible with prior versions of Couchbase (and hence with Memcached), but not with CouchDB.
Technology notes on Couchbase 2.0 include: Read more
Categories: Basho and Riak, Cache, Cassandra, Clustering, Couchbase, MapReduce, Market share and customer counts, MongoDB, NoSQL, Open source, Structured documents | 5 Comments |
More on Cloudera Impala
What I wrote before about Cloudera Impala was quite incomplete. After a followup call, I now feel I have a better handle on the whole thing.
First, some basics:
- Impala is open source code, developed to date entirely by Cloudera people, which adds analytic DBMS capabilities to Hadoop as an alternative to Hive.
- Impala is in public beta, and is targeted for general availability Q1 2013 or so.
- Cloudera plans to get paid for Impala by providing support, and by offering Impala management through its proprietary Cloudera Manager.
- Impala has been under development for about 2 years. A team of 7 or so developers has been mainly in place for a over a year. Furthermore, …
- … notwithstanding that it’s best viewed as a Hive alternative, Impala actually reuses a lot of Hive.
The general technical idea of Impala is:
- It’s an additional daemon that runs on each of your Hadoop nodes.
- Thus, Impala is not subject to Hadoop MapReduce’s latency in starting up Java processes or in storing intermediate result sets to disk.
- Impala operates as a distributed parallel analytic DBMS.*
- Impala works with a variety of Hadoop storage options, each with its own implications for latency or performance.
Categories: Cloudera, Data models and architecture, Data warehousing, Hadoop, HBase, MapReduce, Open source, Predictive modeling and advanced analytics, SQL/Hadoop integration | 12 Comments |
Notes and comments — October 31, 2012
Time for another catch-all post. First and saddest — one of the earliest great commenters on this blog, and a beloved figure in the Boston-area database community, was Dan Weinreb, whom I had known since some Symbolics briefings in the early 1980s. He passed away recently, much much much too young. Looking back for a couple of examples — even if you’ve never heard of him before, I see that Dan ‘s 2009 comment on Tokutek is still interesting today, and so is a post on his own blog disagreeing with some of my choices in terminology.
Otherwise, in no particular order:
1. Chris Bird is learning MongoDB. As is common for Chris, his comments are both amusing and enlightening.
2. When I relayed Cloudera’s comments on Hadoop adoption, I left out a couple of categories. One Cloudera called “mobile”; when I probed, that was about HBase, with an example being messaging apps.
The other was “phone home” — i.e., the ingest of machine-generated data from a lot of different devices. This is something that’s obviously been coming for several years — but I’m increasingly getting the sense that it’s actually arrived.
Categories: Cloudera, Data integration and middleware, Hadoop, HBase, Informatica, Metamarkets and Druid, MongoDB, NoSQL, Open source, Telecommunications | 2 Comments |
Quick notes on Impala
Edit: There is now a follow-up post on Cloudera Impala with substantially more detail.
In my world it’s possible to have a hasty 2-hour conversation, and that’s exactly what I had with Cloudera last week. We touched on hardware and general adoption, but much of the conversation was about Cloudera Impala, announced today. Like Hive, Impala turns Hadoop into a basic analytic RDBMS, with similar SQL/Hadoop integration benefits to those of Hadapt. In particular:
- Impala is Hive-compatible in query language (HQL, which is a whole lot like SQL), metadata, JDBC/ODBC drivers, etc.
- Unlike Hive, Impala does not work through Hadoop MapReduce.
- Unlike Hadoop MapReduce and hence Hive, Impala does not persist intermediate results to disk. This is good for performance, but on extremely long-running queries it increases the risk you’ll have a node failure and have to restart the query from scratch.
- Impala in its first version is missing some Hive syntax, notably in support for UDFs (User-Defined Functions).
Beyond that: Read more
Categories: Cloudera, Columnar database management, Database compression, Hadapt, Hadoop, MapReduce, Open source, SQL/Hadoop integration | 6 Comments |
Notes on Hadoop hardware
I talked with Cloudera yesterday about an unannounced technology, and took the opportunity to ask some non-embargoed questions as well. In particular, I requested an update to what I wrote last year about typical Hadoop hardware.
Cloudera thinks the picture now is:
- 2-socket servers, with 4- or 6-core chips.
- Increasing number of spindles, with 12 2-TB spindles being common.
- 48 gigs of RAM is most common, with 64-96 fairly frequent.
- A couple of 1GigE networking ports.
Discussion around that included:
- Enterprises had been running out of storage space; hence the increased amount of storage. 🙂
- Even more storage can be stuffed on a node, and at times is. But at a certain point there’s so much data on a node that recovery from node failure is too forbidding.
- There are some experiments with 10 GigE.
Categories: Cloudera, Data warehouse appliances, Hadoop, MapR, Solid-state memory, Storage | 7 Comments |
Notes on analytic hardware
I took the opportunity of Teradata’s Aster/Hadoop appliance announcement to catch up with Teradata hardware chief Carson Schmidt. I love talking with Carson, about both general design philosophy and his views on specific hardware component technologies.
From a hardware-requirements standpoint, Carson seems to view Aster and Hadoop as more similar to each other than either is to, say, a Teradata Active Data Warehouse. In particular, for Aster and Hadoop:
- I/O is more sequential.
- The CPU:I/O ratio is higher.
- Uptime is a little less crucial.
The most obvious implication is differences in the choice of parts, and of their ratio. Also, in the new Aster/Hadoop appliance, Carson is content to skate by with RAID 5 rather than RAID 1.
I think Carson’s views about flash memory can be reasonably summarized as: Read more
Categories: Aster Data, Data warehouse appliances, Data warehousing, Hadoop, Solid-state memory, Storage, Teradata | 2 Comments |
IBM Pure jargon
As best I can tell, IBM now has three related families of hardware/software bundles, aka appliances, aka PureSystems, aka something that sounds like “expert system” but in fact has nothing to do with the traditional rules-engine meaning of that term. In particular,
- One of the three families is for the data tier, under the name PureData. That’s what’s new today.
- One of the three families is for the application tier, under the name PureApplication. More information can be found here.
- One of the three families is for “infrastructure”, under the name PureFlex. More information can be found here.
Within the PureData line, there are three sub-families:
- One is based on DB2 pureScale and is said to be “optimized exclusively for transactional data workloads”.
- One is based on Netezza, and is said to be “optimized exclusively for analytic workloads”.
- One is based on DB2 with the shared-nothing option, and is said to be “optimized exclusively for operational analytic data workloads”, notwithstanding that the underlying software has for years been IBM’s flagship general-purpose (non-mainframe) DBMS.
The Netezza part of the story seems to start:
- The Netezza name is being deprecated, except insofar as certain PureData systems are “Powered by Netezza Technology.”
- Netezza didn’t trumpet slipstream hardware enhancements even when it was independent, and IBM sure isn’t reversing that policy now.
- The Netezza software has been enhanced, most notably in a ~20X improvement in concurrency for “tactical” queries.
Perhaps someday I’ll be able to supply interesting details, for example about the concurrency improvement or about the uses (if any) customers are finding for Netezza’s in-database analytics — but as previously noted, analyzing big companies is hard.
Categories: Data warehouse appliances, IBM and DB2, Netezza, OLTP | 4 Comments |
Notes on the Oracle OpenWorld Sunday keynote
I’m not at Oracle OpenWorld, but as usual that won’t keep me from commenting. My bottom line on the first night’s announcements is:
- At many large enterprises, Oracle has a lock on much of their IT efforts. (But not necessarily in the internet or investigative analytics areas.) Tonight’s announcements serve to strengthen that.
- Tonight’s announcements do little to help Oracle in other market segments.
In particular:
1. At the highest level, my view of Oracle’s strategy is the same as it’s been for several years:
Clayton Christensen’s The Innovator’s Solution teaches us that Oracle should focus on selling a thick stack of technology to its highest-end customers, and that’s exactly what Oracle does focus on.
2. Tonight’s news is closely in line with what Oracle’s Juan Loaiza told me three years ago, especially:
- Oracle thinks flash memory is the most important hardware technology of the decade, one that could lead to Oracle being “bumped off” if they don’t get it right.
- Juan believes the “bulk” of Oracle’s business will move over to Exadata-like technology over the next 5-10 years. Numbers-wise, this seems to be based more on Exadata being a platform for consolidating an enterprise’s many Oracle databases than it is on Exadata running a few Especially Big Honking Database management tasks.
3. Oracle is confusing people with its comments on multi-tenancy. I suspect:
- What Oracle is talking about when it says “multi-tenancy” is more like consolidation than true multi-tenancy.
- Probably there are a couple of true multi-tenancy features as well.
4. SaaS (Software as a Service) vendors don’t want to use Oracle, because they don’t want to pay for it.* This limits the potential impact of Oracle’s true multi-tenancy features. Even so: Read more
Notes on Hadoop adoption
I successfully resisted telephone consulting while on vacation, but I did do some by email. One was on the oft-recurring subject of Hadoop adoption. I think it’s OK to adapt some of that into a post.
Notes on past and current Hadoop adoption include:
- Enterprise Hadoop adoption is for experimental uses or departmental production (as opposed to serious enterprise-level production). Indeed, it’s rather tough to disambiguate those two. If an enterprise uses Hadoop to search for new insights and gets a few, is that an experiment that went well, or is it production?
- One of the core internet-business use cases for Hadoop is a many-step ETL, ELT, and data refinement pipeline, with Hadoop executing some or many of the steps. But I don’t think that’s in production at many enterprises yet, except in the usual forward-leaning sectors of financial services and (we’re all guessing) national intelligence.
- In terms of industry adoption:
- Financial services on the investment/trading side are all over Hadoop, just as they’re all over any technology. Ditto national intelligence, one thinks.
- Consumer financial services, especially credit card, are giving Hadoop a try too, for marketing and/or anti-fraud.
- I’m sure there’s some telecom usage, but I’m hearing of less than I thought I would. Perhaps this is because telcos have spent so long optimizing their data into short, structured records.
- Whatever consumer financial services firms do, retailers do too, albeit with smaller budgets.
Thoughts on how Hadoop adoption will look going forward include: Read more
Categories: Cloud computing, Data warehouse appliances, Data warehousing, EAI, EII, ETL, ELT, ETLT, Hadoop, Investment research and trading, Telecommunications | 3 Comments |
Integrated internet system design
What are the central challenges in internet system design? We probably all have similar lists, comprising issues such as scale, scale-out, throughput, availability, security, programming ease, UI, or general cost-effectiveness. Screw those up, and you don’t have an internet business.
Much new technology addresses those challenges, with considerable success. But the success is usually one silo at a time — a short-request application here, an analytic database there. When it comes to integration, unsolved problems abound.
The top integration and integration-like challenges for me, from a practical standpoint, are:
- Integrating silos — a decades-old problem still with us in a big way.
- Dynamic schemas with joins.
- Low-latency business intelligence.
- Human real-time personalization.
Other concerns that get mentioned include:
- Geographical distribution due to privacy laws, which for some users is a major requirement for compliance.
- Logical data warehouse, a term that doesn’t actually mean anything real.
- In-memory data grids, which some day may no longer always be hand-coupled to the application and data stacks they accelerate.
Let’s skip those latter issues for now, focusing instead on the first four.