DB2 OLTP scale-out: pureScale
Tim Vincent of IBM talked me through DB2 pureScale Monday. IBM DB2 pureScale is a kind of shared-disk scale-out parallel OTLP DBMS, with some interesting twists. IBM’s scalability claims for pureScale, on a 90% read/10% write workload, include:
- 95% scalability up to 64 machines
- 90% scalability up to 88 machines
- 89% scalability up to 112 machines
- 84% scalability up to 128 machines
More precisely, those are counts of cluster “members,” but the recommended configuration is one member per operating system instance — i.e. one member per machine — for reasons of availability. In an 80% read/20% write workload, scalability is less — perhaps 90% scalability over 16 members.
Several elements are of IBM’s DB2 pureScale architecture are pretty straightforward:
- There are multiple pureScale members (machines), each with its own instance of DB2.
- There’s an RDMA (Remote Direct Memory Access) interconnect, perhaps InfiniBand. (The point of InfiniBand and other RDMA is that moving data doesn’t require interrupts, and hence doesn’t cost many CPU cycles.)
- The DB2 pureScale members share access to the database on a disk array.
- Each DB2 pureScale member has its own log, also on the disk array.
Something called GPFS (Global Parallel File System), which comes bundled with DB2, sits underneath all this. It’s all based on the mainframe technology IBM Parallel Sysplex.
The weirdest part (to me) of DB2 pureScale is something called the Global Cluster Facility, which runs on its own set of boxes. (Edit: Actually, see Tim Vincent’s comment below.) These might have 20% or so of the cores of the member boxes, with perhaps a somewhat higher percentage of RAM (especially in the case of write-heavy workloads). Specifically:
- The DB2 pureScale Global Cluster Facility maintains a buffer pool (cache) shared by all the DB2 pureScale members.
- Even so, the DB2 pureScale members themselves are in charge of disk access.
So what’s going on here is not an Exadata-like split between database server and storage processing tiers. The Global Cluster Facility also handles lock management, presumably because locking issues only arise when a page gets fetched into the buffer.
The other surprise is that every client talks to every member, usually through a connection pool from an app server. Tim Vincent assures me that DB2 connections are so lightweight this isn’t a problem. Clients have load-balancing code on behalf of the members, and route transactions to whichever pureScale member is least busy.
DB2 pureScale is designed to be pretty robust against outages:
- In the case of planned maintenance, a pureScale member can be “quiesced.” I.e., it stops being given new work; it finishes up its existing work; then maintenance happens; then the member starts being given work again.
- In the case of an unplanned outage, the redo log naturally comes into play. The pureScale twist on this is that a second small instance of DB2 is around — or is started up? — just to handled the redos.
Also, IBM believes that the DB2 pureScale locking strategy gives availability and performance advantages vs. the Oracle RAC (Real Application Cluster) approach. The distinction IBM draws is that any member can take over the lock on a buffer page from any other member, just by attempting to change the page — and the attempt will succeed; only row-level locks can ever block work. Thus, if a node fails, I/O can merrily proceed on other nodes, without waiting for any recovery effort. IBM’s target is <20 seconds for full row availability to be restored.
Obviously, it’s crucial that the Global Cluster Facility machines be fully mirrored, with no double failure — but so what? Modern computing systems have double-points-of-failure all over the place.
Comments
15 Responses to “DB2 OLTP scale-out: pureScale”
Leave a Reply
Always nice to see real scale-out solutions that are not optimized for the shared-almost-nothing TPC-C test (with multiple warehouses). It would be interesting to see some absolute numbers. As much as I admire RDBMA over Infiniband, it is still a microsecond of latency. As always with non-partitioned database scale-out, in order for it to be impressive, it should produce impressive absolute numbers. It would be nice to see some TPC-E benchmarks or a single warehouse TPC-C benchmark.
Hi,
With regard to this…
“The pureScale twist on this is that a second small instance of DB2 is around — or is started up? — just to handle the redos.”
This is called a restart light and can be run on the member where the outage occurred (if it is still running) or another member. There are a number of idle db2 processes (I believe 3) running on each member for this purpose. They use very little resource in their idle state. They will grab some extra resource as required when called into action to perform the restart light.
Thanks, Ciaran! That sort of vendor clarification is very helpful.
I deleted your dupe comment, as I imagine you would have wanted. Thus emboldened, I also fixed a typo in your quote of me. 😉
The only point where “IBM believes …” isn’t really different. Just compare: “… any member can take over the lock on a buffer page from any other member, just by attempting to change the page — and the attempt will succeed” vs. Oracle RAC documentation: “When a consistent block is needed or a changed block is required on another instance, Cache Fusion transfers the block image directly between the affected instances”.
So seems that IBM owned now the technology, which Oracle matured by 10 years.
Good afternoon Curt
Was good to catch up with you again. Let me make a few minor comments to your post and the comments.
1) The actual name of the Global Cluster Facility is Cluster Caching Facility. CCF handles global buffer coherency for dirty pages and lock requests. It is not a requirement to run on separate machines you can run in a separate virtual image (LPAR)as a member.
2) Ciaran’s comment is correct. This very lightweight process only comes into play during a light restart. The restart replay’s the transactions that were in flight at the time you have a failure on one of the members.
3) The pureScale technology leverages the architecture and design of the DB2-z sysplex which was first introduced in the early 90s. So IBM has owned and matured this technology for almost 20 years.
One other comment in reply to Gennadiy. There are a several benefits in pureScale model around caching coherency, have quickly listed a few:
1) If multiple updates on different members are accessing the same page, pureScale allows scanners to proceed without actually having a x lock on the page.
2) RDMA allows very very fast invalidation of a page in one member when that page is updated (commit) by another member. Directly changes a bit on the page without CPU cycles or interrupts (context switching) on that member. Member only pulls a new version of that page if accessed again. Again very low latency, CPU efficient mechanism.
3) When you have a failure on one member there is no impact to transactions on other members that are not reading data that was locked (in flight) when another member fails.
Hi Tim, I should admit my knowledge shortage on IBM technologies, because my world is Oracle-centric :).
But nevertheless, reading the article above I don’t see really different technologies, the definitions just different.
For 1) on IBM: the Oracle’s exclusive lock on buffer block (“buffer page” by IBM) isn’t really blocked. The lock happence only during block update operation itself (which is very fast) – till a block copy is possible.
For 2) on IBM: sending the traffic without CPU cycles is a Infiniband (or OFED-capable 10GBit Ethernet) property, which supported by Oracle as well.
For 3) on IBM: just the same on Oracle: “… the failure of one or more nodes does not affect other nodes in the cluster”
With the new pureScale monitoring features in Speedgain for DB2 LUW
it is possible to monitor
– the complete cluster,
– all Cluster Managment Ressources and
– individual members.
Cause of this, we can see trends in the scalability of DB2 pureScale similar to the ones mentioned above.
“(The point of InfiniBand and other RDMA is that moving data doesn’t require interrupts, and hence doesn’t cost many CPU cycles.)”
Curt,
A correction if you’ll allow. Infiniband RDMA transfers still cost a CPU interrupt. The benefit of something like RDS over Infiniband is the zero-copy nature of the transfer. In fact, HCA interrupt distribution is one of the main improvements in OFED 1.5.1. Prior to this release products like Exadata with RDS suffered non-SMP interrupt handling. It was for that reason, prior to OFED 1.5.1, Exadata V2 and X2 were both limited to no more than 2.5 GB/s over a single HCA. Once the data flow neared that 2.5 GB/s on Nehalem/Westmere EP and Nehalem EX in the X2-8 based systems the single processor core handling the interrupt would go critical. So, the aggregate throughput between any number of cells and the entirety of a full-rack V2 or X2 Exadata Database Machine was 20 GB/s. With OFED 1.5.1, however, the interrupt distribution for the IB HCAs has been fixed and thus each Oracle host in the Database Machine can move 3.2GB/s or an aggregate of 25.6 GB/s bandwidth for data flow between the hosts and the cells.
I explained these important Exadata data flow dynamics in an IOUG webcast that can be found here:
http://kevinclosson.wordpress.com/written-works-and-presentations/
Kevin,
Thanks for the contribution! But would you mind unpacking a few of those acronyms? 🙂
And congrats on the EMC/Greenplum gig! I guess you’re just going to keep going to companies that don’t want to talk with me very much. 🙂
There is some misunderstanding from some of the comments here on the key differences between the pureScale and Z-Sysplex architecture and RAC.
First Gennadiy on your points:
On 2: There is a large difference between how pureScale / Z-Sysplex use RDMA and how Oracle currently does. RDMA used via RDS – which is how it is done in RAC is VERY different from RDMA done via. UDAPL in pureScale. Both are zero copy but one does not generate interrupts as it is a polling mechanism. pureScale does requests / response to/from member/CF with interrupt free RDMA via UDAPL – not via RDS as RAC does. RDS is a convienent programming model if the software already using a sockets like interfaces for inter-node communication – which is my guess is likely why Oracle went this route
On 3. Not quite true. Since the lock space is owned only by the CF, a failure of a DB2 member does not require a redistribute of lock or page ownership to the surviving members as it does in the case of RAC. On RAC, the loss of a node, because it is a distributed locking model, requires rediscovery of what locks were granted from the failed nodes lock space and a redistribution / reconstruction of the lock space. This requires I/O and time and transactions across the cluster cannot be performed until this is completely – ON ANY NODE. Not the case in pureScale / Z-Sysplex which uses a centralized model with the CF as the lock master with the locks duplexed in the secondary.
Aamer,
I was specifically addressing Curt’s point about Infiniband being interrupt free–full stop.
I knew that pureScale used uDAPL but, while I’m more than willing to accept the idea that the uDAPL Data Access Tranport (DAT) services have been implemented on zOS free of interrupts, I’d like to point out that a) uDAPL is not even an Infiniband-specific technology and b) there is nothing about uDAPL that specifies the DAT provider underpinnings (HW/SW/FW) have to offer interrupt-free data transfer and c) there are a lot of Infiniband protocol/HCA/drivers that work together with a significant interrupt payload as is the case with RDS. But, let’s not forget that I pointed out this stuff was fixed in OFED 1.5.1 so while there are still interrupts at least they are now about as SMP-efficient as every single device driver Sequent Computer Systems ever produced dating back to the late 80s. Fast forward to the past as it were.
So, since Curt’s post is about pureScale and pureScale is available on AIX, System x and zOS there is a big question whether or not pureScale, as implemented on these widely varying platforms, is indeed interrupt-free. If they are, that’s cool–I’d be surprised. However, at the risk of boring Curt’s readers to death I’ll reiterate that any statement about Infiniband-related interrupt handling should be tailored to the protocol and the implementation of said protocol.
If you want to see IB HCAs smoking CPUs with interrupts you can use the RDS libskgxp for the Oracle Real Application Clusters interconnect or Exadata iDB protocol (which is actually the same library used for the intra-node RAC communications).
P.S., Curt, sorry for the alphabet soup and, yes, I’m very excited about what we are doing over here at EMC.
Kevin – pureScale is indeed implemented to be interrupt free for the most performance sensitive / frequent operations in the database (e.g. getting an uncontested lock) across all the platforms.
We simply use uDAPL because it is a convenient interface that does not tie us to the transport.
In fact, on Linux, you can run pureScale with 10Ge provided you use Mellanox Connect X 2 adapters and they also do native interrupt free RDMA the same way we do with IB.
The design was basically, as I call it, “shamelessly stolen” from the zOS design on the Sysplex.
Without a doubt, as you point out, in a programming model which requires it, when interrupts are being received, having them spread through all the SMPs is going to be very worthwhile. They continue to be painful as they cause context switches and the cache TLB invalidations etc. which we don’t have to deal with when programming with a non interrupt zOS like model.
>The design was basically, as I call it, “shamelessly stolen” from the zOS design on the Sysplex.
Could you please to provide more details about the design ? Does it use “polling loop” as Sysplex does ?
However, unlike z/OS, which is interrupt-driven and therefore gives up a shared engine when it is has no work to process, Coupling Facility Control Code (CFCC) runs in a polling loop, constantly looking to see if new work has arrived.
http://www.redbooks.ibm.com/abstracts/TIPS0237.html
[…] is based on DB2 pureScale and is said to be “optimized exclusively for transactional data […]