Disk, flash, and RAM
Three months ago, I pointed out that it is hard to generalize about memory-centric database management, because there are so many different kinds. That said, there are some basic points that I’d like to record as background for any future discussion of the subject, focusing on differences between disk and RAM. And while I’m at it, I’ll throw in a few comments about flash memory as well.
This post would probably be better if I had actual numbers for the speeds of various kinds of silicon operations, but I’ll do what I can without them.
For most purposes, database speed is a function of a few kinds of number:
- CPU cycles consumed.
- I/O throughput.
- I/O wait time.
- Network throughput.
- Network wait time.
The amount of storage used is also important, both directly — storage hardware costs money — and because if you save storage via compression, you may get corresponding benefits in I/O. Power consumption and similar costs are usually tied to hardware efficiency; the less gear you use, the less floor space and cooling you may be able to get away with.
When databases move to RAM from spinning disk, major consequences include:
- I/O wait time is reduced by many orders of magnitude.
- I/O throughput is much faster too, but not to the same extent.
- Storage equipment is much more expensive (RAM vs. disk).
because:
- There’s a minimum average wait time before you can read data from a specific place on a disk. At 15,000 RPM or less, it can’t be below 2 milliseconds, even if disk heads moved along the radius at infinite speed. In practice, the best figures are usually in the high single-digit milliseconds.
- Sequential disk access is much faster than random. Disks are capable of sending back over 100 megabytes/second. But as noted in the previous point, they max out on the order of 100 reads/second. So it’s hard to get max throughput unless the average read brings back a megabyte or more.
- These facts apply to writes as well as to reads.
Consequently:
- The advantage of sequential over random I/O is vastly reduced in RAM (it’s never quite eliminated, but it’s a much smaller consideration).
- Things get interesting for data compression as you move to RAM from disk:
- One classic benefit — compression saves I/O — is much less important than with disk.
- Another classic benefit — compression saves storage costs — is much increased in importance.
- Compression benefits in the area of network traffic aren’t much affected.
But notwithstanding everything else, you still need a persistent-storage story. Typically, that’s just your update/transaction log. Hence in-memory write performance is actually gated by the speed at which you can stream your update log to persistent storage — unless, of course, you’re running some kind of event processing/data reduction system and truly are willing to discard most of the data that passes through.
When you have to go to spinning disk, your data access methods are commonly indexes and scans, because those are the approaches that minimize the number of disk reads. But when data lives in RAM, pointer-chasing is a reasonable choice. Also, directly calculated addresses seem to be used more in memory than they are on disk. For example:
- QlikView and Neo4j both rely on direct addressing.
- Neo4j also has a lot of pointer-chasing.
- solidDB relies on the walking of trees, aka Patricia tries.
- Workday chases references among a whole lot of different objects.
Flash, of course, is another kind of silicon memory — persistent, and slower than RAM. Beyond that:
- You generally attach a lot more flash to one server than you would RAM. This can create bandwidth bottlenecks between the flash and the CPU. If you use PCIe, you could have issues with attaching as much flash as you want. If you use disk controllers instead, as Teradata does, you could have issues with throughput.
- Sequential writes to flash are slow, perhaps even slower than sequential writes to spinning disk.
- Random writes to flash require writing a whole block.
- Flash had a bad reputation for the number of times you can write to it before it wears out. But software has done a good job of obviating the problem, e.g. via error-correcting codes.
- In connection with that, the cheaper but less reliable form of flash — MLC vs. SLC (Multi-/Single-Level Cell) — is becoming more acceptable for enterprise use. For example, Clustrix appliances use MLC.
In theory, all the comments about random vs. sequential, pointers vs. indexes, and so on carry over pretty well from RAM to flash. In practice, however, data access methods used on flash seem to be pretty similar to those on spinning disk. I’m not totally sure why.
Comments
6 Responses to “Disk, flash, and RAM”
Leave a Reply
“Sequential writes to flash are slow, perhaps even slower than sequential writes to spinning disk.” – that’s not the case for last several years.
Best hard drives can write at ~200MB/sec and most modern SSDs (SATA/SAS – PCIe are much faster) write over 400MB/sec (and best ones over 500MB/sec).
With RAM, compression may not save you actual dollars in real life because people don’t tend to dynamically grow the amount of RAM on their machines. A realistic setting is that you try to keep the “hot” data in RAM, and you have a backend of archives that are expensive to recover. If you can keep more data in RAM through compression, you get better overall performance.
Good blog, as is the norm around here. It’s important to keep this kind of stuff up to date in one’s head when considering differently optimized database solutions.
This isn’t a disagreement with you at all – more just a bit more detail – but I’d like to point out that random access to RAM still seems to be about an order of magnitude slower than sequential access. So while the sequential/random difference on RAM is much smaller as a % of throughput than it is for disk, it is actually a *larger* difference in terms of net throughput! So, is it a smaller or larger difference? It depends on how you count and what you want to do. But data structures optimized for sequential access certainly remains an important consideration even in RAM-based systems.
My touchstone on this topic is the admittedly somewhat old article “The Pathologies of Big Data” on ACM Queue – http://queue.acm.org/detail.cfm?id=1563874
Cheers,
Ethan
Yes, Igor is right. One place where flash data structures are often different than disk and memory-based data structures are around log-structured storage (e.g. log structured merge trees instead of B trees) because of the very large differences between random writes and sequential writes on flash.
Also, as far as main memory DBs needing to stream the update log to storage, I’d like to point that one big advantage of deterministic database systems (like the Calvin system we’re building at Yale) is that you only have to log the transaction input rather than every action of the transaction (as the ARIES protocol requires). This is because the input deterministically generates the final state, so logging the input is all you need. This can result in a factor of 10X decrease in log output.
Igor, Ethan — thanks for the catches!
Dan — just one great example of a general point — when the log becomes the bottleneck, it becomes more important to optimize performance of the log. 🙂
[…] Clustrix uses MLC flash. […]