NoSQL
Discussion of NoSQL concepts, products, and vendors.
Some trends that will continue in 2013
I’m usually annoyed by lists of year-end predictions. Still, a reporter asked me for some, and I found one kind I was comfortable making.
Trends that I think will continue in 2013 include:
Growing attention to machine-generated data. Human-generated data grows at the rate business activity does, plus 0-25%. Machine-generated data grows at the rate of Moore’s Law, also plus 0-25%, which is a much higher total. In particular, the use of remote machine-generated data is becoming increasingly real.
Hadoop adoption. Everybody has the big bit bucket use case, largely because of machine-generated data. Even today’s technology is plenty good enough for that purpose, and hence justifies initial Hadoop adoption. Development of further Hadoop technology, which I post about frequently, is rapid. And so the Hadoop trend is very real.
Application SaaS. The on-premises application software industry has hopeless problems with product complexity and rigidity. Any suite new enough to cut the Gordian Knot is or will be SaaS (Software as a Service).
Newer BI interfaces. Advanced visualization — e.g. Tableau or QlikView — and mobile BI are both hot. So, more speculatively, are “social” BI (Business Intelligence) interfaces.
Price discounts. If you buy software at 50% of list price, you’re probably doing it wrong. Even 25% can be too high.
MySQL alternatives. NoSQL and NewSQL products often are developed as MySQL alternatives. Oracle has actually done a good job on MySQL technology, but now its business practices are scaring companies away from MySQL commitments, and newer short-request SQL DBMS are ready for use.
Categories: Business intelligence, Hadoop, MySQL, NewSQL, NoSQL, Open source, Oracle, Pricing, Software as a Service (SaaS), Surveillance and privacy | 3 Comments |
Couchbase 2.0
My clients at Couchbase checked in.
- After multiple delays, Couchbase 2.0 is well into beta, with general availability being delayed by the holiday season as much as anything else.
- Couchbase (the company) now has >350 subscription customers, almost all for Couchbase (the product) — which is to say for what was known as Membase, which is basically a persistent version of Memcached.
- There also are many users of open source Couchbase, most famously LinkedIn.
- Orbitz is a much-mentioned flagship paying Couchbase customer.
- Couchbase customers mainly seem to be replacing a caching layer, Memcached or otherwise.
- Couchbase headcount is just under 100.
The big changes in Couchbase 2.0 versus the previous (1.8.x) version are:
- JSON storage, including secondary indexes.
- Multi-data-center replication.
- A back-end change from SQLite to a heavily forked version of CouchDB, called Couchstore.
Couchbase 2.0 is upwards-compatible with prior versions of Couchbase (and hence with Memcached), but not with CouchDB.
Technology notes on Couchbase 2.0 include: Read more
Categories: Basho and Riak, Cache, Cassandra, Clustering, Couchbase, MapReduce, Market share and customer counts, MongoDB, NoSQL, Open source, Structured documents | 5 Comments |
Notes and comments — October 31, 2012
Time for another catch-all post. First and saddest — one of the earliest great commenters on this blog, and a beloved figure in the Boston-area database community, was Dan Weinreb, whom I had known since some Symbolics briefings in the early 1980s. He passed away recently, much much much too young. Looking back for a couple of examples — even if you’ve never heard of him before, I see that Dan ‘s 2009 comment on Tokutek is still interesting today, and so is a post on his own blog disagreeing with some of my choices in terminology.
Otherwise, in no particular order:
1. Chris Bird is learning MongoDB. As is common for Chris, his comments are both amusing and enlightening.
2. When I relayed Cloudera’s comments on Hadoop adoption, I left out a couple of categories. One Cloudera called “mobile”; when I probed, that was about HBase, with an example being messaging apps.
The other was “phone home” — i.e., the ingest of machine-generated data from a lot of different devices. This is something that’s obviously been coming for several years — but I’m increasingly getting the sense that it’s actually arrived.
Categories: Cloudera, Data integration and middleware, Hadoop, HBase, Informatica, Metamarkets and Druid, MongoDB, NoSQL, Open source, Telecommunications | 2 Comments |
Integrated internet system design
What are the central challenges in internet system design? We probably all have similar lists, comprising issues such as scale, scale-out, throughput, availability, security, programming ease, UI, or general cost-effectiveness. Screw those up, and you don’t have an internet business.
Much new technology addresses those challenges, with considerable success. But the success is usually one silo at a time — a short-request application here, an analytic database there. When it comes to integration, unsolved problems abound.
The top integration and integration-like challenges for me, from a practical standpoint, are:
- Integrating silos — a decades-old problem still with us in a big way.
- Dynamic schemas with joins.
- Low-latency business intelligence.
- Human real-time personalization.
Other concerns that get mentioned include:
- Geographical distribution due to privacy laws, which for some users is a major requirement for compliance.
- Logical data warehouse, a term that doesn’t actually mean anything real.
- In-memory data grids, which some day may no longer always be hand-coupled to the application and data stacks they accelerate.
Let’s skip those latter issues for now, focusing instead on the first four.
Uninterrupted DBMS operation — an almost-achievable goal
I’m hearing more and more stories about uninterrupted DBMS operation. There are no iron-clad assurances of zero downtime; if nothing else, you could crash your whole system yourself via some kind of application bug. Even so, it’s a worthy ideal, and near-zero downtime is a practical goal.
Uninterrupted database operations can have a lot of different aspects. The two most basic are probably:
- High availability/fail-over. If a system goes down, another one in the same data center is operational almost immediately.
- Disaster recovery. Same story, but not in the same data center, and hence not quite as immediate.
These work with single-server or scale-out systems alike. However, scale-out and the replication commonly associated with it raise additional issues in continuous database operation:
- Eventual consistency. Scale-out and replication create multiple potential new points of failure, server and network alike. Eventual consistency ensures that a single such failure doesn’t take any part of the database down.
- The use of replicas to avoid planned downtime. If you do rolling maintenance, then you can keep a set of servers with the full database up at all times.
Finally, if you really care about uninterrupted operation, you might also want to examine:
- Administrative tools and utilities. The better your tools, the better your chances of keeping your system up. That applies to anything from administrative dashboards to parallel backup functionality.
- Fencing of in-database analytic processes. If you’re going to do in-database analytics, fenced/out-of-process ones are a lot safer than the alternative.
- Online schema changes. If you change a schema in a relational DBMS, that doesn’t necessarily entail taking the database offline.
Let’s discuss some of those points below.
Aerospike, the former Citrusleaf
My new clients at Aerospike have a range of minor news to announce:
- A company and product name change (they used to be Citrusleaf).
- Some new people and funding.
- In association with an acqui-hire — of AlchemyDB guy Russ Sullivan — some unspecified future technical plans.
- A community edition (Aerospike, nee’ Citrusleaf, is closed-source).
Mainly, however, they want to call your attention to the fact that they’ve been selling a fast, reliable key-value store, with a number of production references, and want to suggest that other organizations should perhaps buy it as well.
Generally, the Aerospike product story is as I described in two posts last year. At the highest level:
- Aerospike has a key-value data model.
- Secondary indexes and so on are still futures.
- Aerospike is clustered, of course.
- Two hardware/storage choices are encouraged:
- Spinning disk, but you keep all your data in RAM.
- Solid-state disk.
AeroSpike’s three core marketing claims are performance, consistent performance, and uninterrupted operations.
- Aerospike’s performance claims are supported by a variety of blazing internal benchmarks.
- Aerospike’s consistent performance claims are along the lines of sub-millisecond latency, with 99.9% of responses being within 5 milliseconds, and even a node outage only borking performance for some 10s of milliseconds.
- Uninterrupted operation is a core AeroSpike design goal, and the company says that to date, no AeroSpike production cluster has ever gone down.
Aerospike technical details start with the expected: Read more
Categories: Aerospike, Market share and customer counts, Memory-centric data management, NoSQL, Pricing | 2 Comments |
How immediate consistency works
This post started as a minor paragraph in another one I’m drafting. But it grew. Please also see the comment thread below.
Increasingly many data management systems store data in a cluster, putting several copies of data — i.e. “replicas” — onto different nodes, for safety and reliable accessibility. (The number of copies is called the “replication factor”.) But how do they know that the different copies of the data really have the same values? It seems there are three main approaches to immediate consistency, which may be called:
- Two-phase commit (2PC)
- Read-your-writes (RYW) consistency
- Prudent optimism 🙂
I shall explain.
Two-phase commit has been around for decades. Its core idea is:
- One node commands other nodes (and perhaps itself) to write data.
- The other nodes all reply “Aye, aye; we are ready and able to do that.”
- The first node broadcasts “Make it so!”
Unless a piece of the system malfunctions at exactly the wrong time, you’ll get your consistent write. And if there indeed is an unfortunate glitch — well, that’s what recovery is for.
But 2PC has a flaw: If a node is inaccessible or down, then the write is blocked, even if other parts of the system were able to accept the data safely. So the NoSQL world sometimes chooses RYW consistency, which in essence is a loose form of 2PC: Read more
Categories: Aster Data, Clustering, Hadoop, HBase, IBM and DB2, Netezza, NoSQL, Teradata, Vertica Systems | 11 Comments |
Database diversity revisited
From time to time, I try to step back and build a little taxonomy for the variety in database technology. One effort was 4 1/2 years ago, in a pre-planned exchange with Mike Stonebraker (his side, alas, has since been taken down). A year ago I spelled out eight kinds of analytic database.
The angle I’ll take this time is to say that every sufficiently large enterprise needs to be cognizant of at least 7 kinds of database challenge. General notes on that include:
- I’m using the weasel words “database challenge” to evade questions as to what is or isn’t exactly a DBMS.
- One “challenge” can call for multiple products and technologies even within a single enterprise, let alone at different ones. For example, in this post the “eight kinds of analytic database” are reduced to just a single category.
- Even so, one product or technology may be well-suited to address a couple different kinds of challenges.
The Big Seven database challenges that almost any enterprise faces are: Read more
Is salesforce.com going to stick with Oracle?
Surprisingly often, I’m asked “Is salesforce.com going to stick with Oracle?” So let me refer to and expand upon my previous post about salesforce.com’s database architecture by saying:
- Today, salesforce.com uses Oracle as one of several ways to store data.
- salesforce.com’s use of Oracle isn’t very relational.
- salesforce.com is investing in HBase, after exploring other NoSQL options.
- salesforce.com surely has a very inexpensive Oracle license, reducing pressure to move any time soon. However …
- … salesforce.com’s use of Oracle has flipped from being a marketing advantage to a marketing liability.*
- It will be some years before any NoSQL option is mature enough to handle salesforce.com’s work.
- Especially through Heroku, salesforce.com is getting ever more experience with PostgreSQL.
Some day, Marc Benioff will probably say “We turned off Oracle across most of our applications a while ago, and nobody outside the company even noticed.”
*in that
- The marketing benefit “Oracle — it’s what the trustworthy big boys use” hardly matters any more.
- The marketing annoyance of Larry Ellison citing salesforce.com’s use of Oracle keeps growing.
Note: This blog post is less readable than it would be if I’d found a better workaround to WordPress’ bugs in the area of nested bullet points. I’m sorry.
Categories: NoSQL, OLTP, Oracle, salesforce.com, Software as a Service (SaaS) | 10 Comments |
Notes on HBase 0.92
This is part of a four-post series, covering:
- Annoying Hadoop marketing themes that should be ignored.
- Hadoop versions and distributions, and their readiness or lack thereof for production.
- In general, how “enterprise-ready” is Hadoop?
- HBase 0.92 (this post)
As part of my recent round of Hadoop research, I talked with Cloudera’s Todd Lipcon. Naturally, one of the subjects was HBase, and specifically HBase 0.92. I gather that the major themes to HBase 0.92 are:
- Performance, scalability, and so on.
- “Coprocessors”, which are like triggers or stored procedures.
- Security, as the first major application of co-processors.
HBase coprocessors are Java code that links straight into HBase. As with other DBMS extensions of the “links straight into the DBMS code” kind,* HBase coprocessors seem best suited for very sophisticated users and third parties.** Evidently, coprocessors have already been used to make HBase security more granular — role-based, per-column-family/per-table, etc. Further, Todd thinks coprocessors could serve as a good basis for future HBase enhancements in areas such as aggregation or secondary indexing. Read more
Categories: Benchmarks and POCs, Cloudera, Hadoop, HBase, MapReduce, NoSQL, Open source, Storage, Theory and architecture | 2 Comments |