Vertica’s innovative architecture for flash, plus more about temp space than you perhaps wanted to know
Vertica is announcing:
- Technology it already has released*, but has not published any reference architectures for.
- A Barney partnership.**
In other words, Vertica has succumbed to the common delusion that it’s a good idea to put out half-baked press releases the week of TDWI conferences. But if we look past that kind of all-too-common nonsense, Vertica is highlighting an interesting technical story, about how the analytic DBMS industry can exploit solid-state memory technology.
*Upgrades to Vertica FlexStore to handle flash memory, actually released as part of Vertica 4.0
** With Fusion I/O
To set the context, let’s recall a few points I’ve noted in the past:
- Solid-state memory’s price/throughput tradeoffs obviously make it the future of database storage.
- The flash future is coming soon, in part because flash’s propensity to wear out is overstated. This is especially true in the case of modern analytic DBMS, which tend to write to blocks all at once, and most particularly the case for append-only systems such as Vertica.
- Being able to intelligently split databases among various cost tiers of storage – e.g. flash and disk – makes a whole lot of sense.
Taken together, those points tell us:
For optimal price/performance, analytic DBMS should support databases that run part on flash, part on disk.
While all this is a future for some other analytic DBMS vendors, Vertica is shipping it today.* What’s more, three aspects of Vertica’s architecture make it particularly well-suited for hybrid flash/disk storage, in each case for a similar reason – you can get most of the performance benefit of all-flash for a relatively low actual investment in flash chips:
- Vertica lets you split tables by column, and Vertica FlexStore is versatile enough to let you put only the most-used columns in flash. (Vertica offers a figure that 85% of usage calls on only 15% of columns, but I don’t know how rigorously grounded those numbers are.)
- To the extent that Vertica data is more compressed than many of Vertica’s competitors’ (which it probably is, debates over the magnitude of Vertica’s advantage notwithstanding), the total storage-hardware cost of sticking stuff in flash is less when you use Vertica than with other systems.
- Vertica has relatively less need for temp space than some other systems. (Vertica uses figures of <20% of total storage, vs. 30%+ for some other systems.) If you want to use flash for temp space, so as to accelerate your toughest queries, that can save you some cash …
- … and by the way, temp space is an especially good use of flash, because temp space is accessed in a less sequential manner than data storage is.
The least obvious of those points are about temp space; I only understood the particulars when Vertica development chief Shilpa Lawande explained them to me Thursday.
* At least in theory; customer adoption may be a different matter.
But before drilling down on temp space, let me first note that there’s one offsetting factor to all those “We need somewhat less flash than the other guys” Vertica advantages. Like all serious databases, a Vertica installation keeps two or more copies of all data, to that there’s no storage single point of failure. In a flexible system like Vertica, you can put one copy on flash and one on disk. But if you do that in Vertica, you forgo fully exploiting one possible benefit of Vertica’s architecture – the ability to store different copies of a column in different orders, which are beneficial for accelerating different groups of queries.*
*More precisely, you don’t get the full benefits of flash acceleration for every query touching those columns.
OK. Back to temp space. There are four kinds of things you can put in storage if you’re running a database management system:
- The software itself.
- Persistent data. (I.e., tables, if the DBMS you’re running is relational.)
- Metadata, especially the kind that lets you find data — indexes, zone maps, catalogs, etc.
- Temporary data constructs built as part of, say, a sort-merge join. These, by definition, are what populate temp space.
Just to be clear, those constructs are NOT temporary tables of the sort created by, say, Microstrategy; such tables are handled like any other data. Rather, they are ephemeral creations and, so far as I can tell, not tables at all.
Vertica offered two theories as to why its DBMS requires less temp space than competitors do:
- To the extent data is decompressed before being operated on in memory by the DBMS, that decompression would of course also apply to temp space as well. Vertica prides itself on keeping data compressed all the way through, and seems to get away with smaller temp space allocations as a benefit.
- Since Vertica can store columns in expedient sort orders, it does less sorting overall, and sorting is a big use of temp space.
Obviously, no matter which DBMS you use, the amount of temp space you need is surely workload-dependent. Even so, Vertica’s claim to something of an advantage seems legit.
Truth be told, I’m not convinced the savings involved are great enough to matter a whole lot – but it’s a fun subject to think through. 🙂
And finally: One of my biggest surprises since starting to look at analytic-DBMS-on-flash has been the centrality of temp space. Talking to Vertica Thursday, I finally uncovered a key reason why: Temp space tends to be accessed via multiple streams of data at once. I’m still struggling with WHY that is true, with two reasons suggested being:
- Temp space can be accessed by multiple operations at once. (But isn’t that also true of the rest of storage?)
- Merge sorts, a common use of temp space, read multiple streams of data. (Couldn’t you tweak your software to make that not be true?)
But if we grant that temp space naturally is accessed in multiple places at once – well, that’s a lot like random I/O, and if you’re doing a lot of random reads, you’d love to use something other than spinning disk.
Comments
10 Responses to “Vertica’s innovative architecture for flash, plus more about temp space than you perhaps wanted to know”
Leave a Reply
Temp space can get a lot of sequential IO from external hash join and external sort. There will be multiple streams (one per hash partition, one per sort run) but the access within each stream is sequential.
There can be concurrent IO requests. Fast implementations of external hash join and external sort use async IO for write behind and read ahead. The goal is to overlap CPU and IO assuming parallel query or concurrent queries are not otherwise using those resources.
As Vertica allows tables to be stored in many interesting orders, they are probably better at avoiding the need to do large sorts for ORDER BY and might handle many aggregation requests by doing an index scan. Someone who knows something about Vertica might be able to answer whether it has features to avoid the need to do large joins via external hash join.
Mark,
That first part is what I didn’t fully understand. If there are concurrent streams that are each sequential, why does that feel more random to Vertica than if there are multiple concurrent streams reading tables?
But I was talking with the right people, and they seemed pretty sure about it.
Best,
CAM
There are several points I’d like to clarify:
1) Why does Vertica need less temporary space compared to other databases? Because a) as Mark mentions in his comment, Vertica has a number of optimizations that take advantage of sortedness of data to avoid expensive sorts at runtime b) we keep user data compressed and encoded through query execution and hence operations that might otherwise need an externalizing join or sort can be done in-memory c) we encode and compress intermediate data produced by externalizing operators.
2) Why does flash improve performance of operations like sorts that access temporary space? Couple reasons – a) Fusion-io cards have much higher bandwidth and lower latency compared to regular magnetic disks, so there is less of a disk bottleneck for I/O intensive operations to begin with b) Such operations involve writing to temporary storage and reading it back in multiple streams concurrently. You want to have as many simultaneous merge streams as memory would allow to cut down the number of passes over the data. This I/O pattern appears more akin to random I/O to the disk compared to say a sequential table scan, and hence benefits from use of flash.
3) As concurrency increases, benefits of using flash become even more pronounced.
[…] tend to squirm when something is described as “random” that clearly is not. That said, a comment by Shilpa Lawande on our recent Flash/temp space discussion suggests the following way of framing a key […]
I’m not clear whether the Vertica announcements about flash relate directly to Fusion-IO or whether it’s “here’s a product that will make use of this feature”.
That said, there is a *huge* gotcha with Fusion-IO drive performance – you only get maximum throughput when the drive is nearly empty (<20% utilised). This is based on a posting on the (very reliable) MySQL Performance blog. http://www.mysqlperformanceblog.com/2010/07/17/ssd-free-space-and-write-performance/
Here's the money quote: "It is clear the maximal throughput strongly depends on available free space. With 100GiB utilization we have 933.60 MiB/sec, with 150GiB (half of capacity) 613.48 MiB/sec and with 200GiB it drops to 354.37 MiB/sec, which is 2.6x times less comparing with 100GiB."
I would be very interested to know: a) Have Vertica have tested these cards in this way and do they have any comment? b) Is this Fusion-IO specific or will we see a similar pattern for other flash memory implementations (eg Exadata)?
I'm sure the irony of a "next generation" technology reverting back to the "add more spindles" paradigm will not be lost on DBMD2 readers. Given the cost of these cards I (personally) would be tempted throw either more cheap SSDs or more RAM at the problem.
Much has been made of how flash devices (FusionIO not being an exception) suffer a performance hit as they get full. The basic problem is that the write unit of disk, 512 byte sectors, or 4K file system clusters, or 64K database blocks, or whatever, is dramatically smaller than the smallest flash block erase size. This means that tiny writes may cause a whole flash block to be reorganized, so you pay for a much larger write than was issued. But you probably knew that already.
The issue isn’t helped by the industry practice of disabling any grooming algorithms until the device gets full the first time. This delivers a great first impression, but then the algorithm has to play “catch up”, causing a big cliff:
http://www.ssdperformanceblog.com/2010/07/on-benchmarks-on-ssd/
If we believe the cooler heads will prevail, we will look beyond this, and focus on the steady state performance of the device.
As a digression, I feel like mentioning that variable performance isn’t new. Go measure the inner and outer tracks of your spinning disk, and you may find that the outer tracks deliver data faster, as more bits fly past the head at the same RPM, and furthermore there are less track seeks as the tracks are longer. See what happens after file system fragmentation sets in. Etc.
But the key point we at Vertica are making here is that certain I/O patterns are easier for a flash device to accommodate than others. Small, random writes on a full disk tend to suffer the worst degradation, as odds are high that there will be write amplification. But workloads that write large blocks, don’t do updates, and delete files (taking advantage of “trim” commands) are very flash-friendly. To illustrate, I wrote a program that does concurrent streams of 1MB writes, then reads the streams back, then deletes the files. This is pretty much the I/O pattern you’d expect from a database temp file: write once with substantial buffering, read once, then delete. This is run against a single 320GB card (320GB nominal, 280GB available), even though the machine has 2 of them. The card was broken in, so that this is representative of “steady state” performance. I am reporting the range of throughput seen over 3 runs to give a sense of deviation:
1. With an empty disk, 368-412MB/s
2. With 100GB on the card, 373-400MB/s
3. With 200GB on the card, 396-418MB/s (had to use a smaller file for this test, so this may not be 100% comparable)
So we see that the worst case degradation pattern is not encountered in this workload. We reach the same conclusion when FusionIO cards are presented with the Vertica workload, too, which consists of temp files, and column files (which are never updated in place).
How much does it cost to buy spinning disks that can match a 320 GB MLC flash card from Fusion IO. The Dell list price for that card is $7000 and the peak throughput is 500 MB/sec. How many disks are required to match that when doing a small number of large reads and reads — for example 10 streams each doing 1MB or larger reads and writes?
The short answer is that using 15K RPM SFF SAS disks, our 2U servers don’t hold quite enough to get that throughput number on local storage.
Testing 7 SFF SAS (box holds 8, but 1 held the OS), w/ 1MB reads, RAID 0 with 1MB blocks, I got a bit less than the 400MB sustained read+write from the flash card on the same test workload.
Box will hold at least 2 of the flash cards, but drive bays are maxed out.
That said, Dell list price of $7K doesn’t seem to offer much better price/performance (of the storage subsystem considered by itself) on this workload. 8x 15K SFF SAS aren’t exactly cheap to buy, and if we need more rack space+power+cooling then that costs something too, so the price/performance is probably similar. But partners probably don’t pay Dell list price, and if you look at total system price/performance things look a good bit better.
$7000 is a relevant amount of money, but comparing it to the raw cost of the number of disks providing the same performance for this specific usage pattern may be a bit misleading: first it’s necessary to add in the cost of the storage enclosure, the floor space, the cost of having the disks spinning (power and cooling) to calculate the TCO;
second: likely the target prospects are not using the device with the pattern in the example hence requiring even more disks.
[…] Vertica used last week’s TDWI Conference to announce its FlexStore for Flash in Vertica & Flash: Runaway Performance at a Low Price. Good stuff, more vendors should provide that flexibility around different data tiers. Although I’m not sure we need all that (flexibility comes with complexity, and complexity it bad, but I guess you knew that already). Curt Monash of DBMS2 also commented about this, and used the opportunity to ramble about temp space, in Vertica’s innovative architecture for Flash, plus more about temp space than you perhaps wanted to…. […]