February 15, 2012
Quick notes on MySQL Cluster
According to the MySQL Cluster home page, today’s MySQL Cluster release has — give or take terminology details 🙂 — added transparent sharding (Edit: Actually, please see the first comment below) and a memcached interface. My quick comments on all this to a reporter a couple of days ago were:
- Persistent memcached is a useful thing. Couchbase’s sales illustrate that point: http://www.dbms2.com/2012/02/01/couchbase-update/
- MySQL has always given good performance when used just as a key-value store, e.g. http://www.dbms2.com/2010/08/22/workday-technology-stack/ . So it’s reasonable to hope the memcached interface will have good performance out of the box.
- MySQL’s clustering capabilities have long been weak, providing a window of opportunity for companies and products such as Schooner Information and dbShards. The gold standard for clustering is:
- Efficient transparent sharding: http://www.dbms2.com/2011/02/24/transparent-sharding/
- Synchronous replication at much better than two-phase-commit speeds. http://www.dbms2.com/2011/10/23/schooner-pivots-further/
I don’t really know enough about MySQL Cluster right now to comment in more detail.
Comments
2 Responses to “Quick notes on MySQL Cluster”
Leave a Reply
Hi Curt,
Good to hear you discussing MySQL Cluster again, though a few clarifications may be needed :
1. ‘MySQL Cluster’ is the brand name of MySQL nodes clustered via the Ndbcluster storage engine. This is not the same thing as a cluster of MySQL servers using non-clustered storage engines, which is sometimes referred to as a ‘MySQL Cluster’, or ‘clustered MySQL’. Many large ‘scale out’ deployments of MySQL are ‘clustered MySQL’, which is not the same thing as ‘MySQL Cluster’. Our fault for a dubious naming decision!
2. ‘MySQL Cluster’ has *always* had transparent sharding. It ‘shards’ rows based on a hash of some part(s) of the primary key (kind of similar to Teradata). Other sharding schemes (range, list, etc.) are supported.
3. ‘MySQL Cluster’ now has support for a Memcached interface / persistent Memcached. In parallel, the InnoDB MySQL storage engine also has a Memcached interface / persistent Memcached facility in development. In both cases, the idea is to give better performance for simpler more stable access patterns,
I think your comment regarding weak clustering capabilities certainly applies to the current MySQL Server, but I disagree in the case of MySQL Cluster, as this is what defines it! Capabilities that have existed for a long time include :
– Transparent sharding/distribution
– Parallel query execution
– Synchronous replication via 2PC
– Automatic failure/recovery handling
– Internal HA monitoring, cluster membership etc
– Full distributed transaction support
– Consistent parallel backup
– etc etc.
Perhaps it’s best to clarify which points apply to MySQL Server and which to MySQL Cluster. I am very happy to help clarify any questions you have here or offline.
Frazer Clement
(MySQL Cluster developer)
Curt,
The MySQL Cluster “transparent sharding” capabilities have always been there, its what is referred to as “Black Box” automated sharding. Its nice because it works automatically, and it is “transparent.” Rows are spread randomly or in a round robin fashion based on the row’s primary key. This provides a simple mechanism to transparently and automatically shard, but it is only a good idea in so far as tables are normalized to primary keys. In most schemas, the primary key is rarely meaningful, which greatly reduces efficiency of both reads and writes (due to the “scatter-gather” nature of Black Box sharding).
For example, to reconstruct a list, say, of customer orders, the customer rows will be in one server, each order row could be in another server, and if you go to the order line level, then each of those rows are in their own server as well. Any joins have to be reconstructed (which is expensive). Writes have similar issues, this architecture can result in a lot of distributed transactions (the evil in any sharding system).
This “scatter-gather” mechanism is used by lots of “sharding” technologies, but Relational Sharding is far more effective. With Relational Sharding the data is partitioned along application relationships, with the result that related information is stored together (for both read and write operations). The technology to do this is more challenging, but the results are highly effective. With like data in the same shard server, efficiency is naturally improved, maintaining the integrity of the application data model in the sharding scheme itself.