Data warehouse storage options — cheap, expensive, or solid-state disk drives
This is a long post, so I’m going to recap the highlights up front. In the opinion of somebody I have high regard for, namely Carson Schmidt of Teradata:
- There’s currently a huge — one order of magnitude — performance difference between cheap and expensive disks for data warehousing workloads.
- New disk generations coming soon will have best-of-both-worlds aspects, combining high-end performance with lower-end cost and power consumption.
- Solid-state drives will likely add one or two orders of magnitude to performance a few years down the road. Echoing the most famous logjam in VC history — namely the 60+ hard disk companies that got venture funding in the 1980s — 20+ companies are vying to cash in.
In other news, Carson likes 10 Gigabit Ethernet, dislikes Infiniband, and is “ecstatic” about Intel’s Nehalem, which will be the basis for Teradata’s next generation of servers.
Here’s the longer version.
Oliver Ratzesberger of eBay made the interesting comment to me that 15K RPM disk drives could have 10X or more the performance of 7200 RPM ones, a difference that clearly is not explained just by rotational speed. He said this was due to the large number of retries required by the cheaper drives, which eBay had tested as being in the 5-8X range on its particular equipment, for an overall 10X+ difference in effective scan rates. When I continued to probe, Oliver suggested that the guy I really should talk with is Carson Schmidt of Teradata, advice I took eagerly based on past experience.
Yesterday, Carson — who was unsurprised at Oliver’s figures* — patiently explained to me his views of the current differences between cheap and expensive disk drives. (Carson uses the terms “near-line” and “enterprise-class”.) Besides price, cheap drives optimize for power consumption, while expensive drives optimize for performance and reliability. Currently, for Teradata, “cheap” equates to SATA, “expensive” equates to Fibre Channel, and SAS 1.0 isn’t used. But SAS 2.0, coming soon, will supersede both of those interfaces, as discussed below.
*Carson did note that the performance differential varied significantly by the kind of workload. The more mixed and oriented to random reads the workload is, the bigger the difference. If you’re just doing sequential scans, it’s smaller. Oliver’s order-of-magnitude figure seemed to be based on scan-heavy tasks.
As I understand Carson’s view, mechanical features sported only by expensive drives include:
- Smaller media, more platters, and more disk heads
- Faster rotational speeds
- Enclosures that do a better job of damping vibration from disk rotation or fans.
Electronic features of expensive storage includes:
- More CPU (at least 2X)
- More RAM (also at least 2X), which is useful for caching.
- Dual ports for networking. Teradata doesn’t just use dual storage ports for reliability; it load balances across them and sometimes gets significantly enhanced performance.
Finally, there is firmware, in which expensive disk drives seem to have two major kinds of advantages:
- Command scheduling/queuing, which Carson believes provides a benefit at least comparable to the 2X derived from different rotational speeds.
- Better data integrity checking, in line with the T10 DIF standard. Not only does this seem to give much higher reliability, but it can be done closer to the platter, yielding a performance advantage.
Apparently, this isn’t even possible for SATA and SAS 1.0 disk drives, but is common for drives that use the Fibre Channel interface, and will also be possible in the forthcoming SAS 2.0 standard. (As you may have guessed, I’m a little fuzzy on the details of this firmware stuff.)
In Carson’s view, the disk drive industry has consolidated to the point that there are two credible vendors of expensive/enterprise-class disk drives: Seagate and Hitachi. What Teradata actually uses in its own systems right now is:
- In Teradata’s high-end 5550 line — Seagate Fiber Channel 3-1/2″ drives
- In Teradata’s mid-range 2550 line — SAS drives from Seagate and perhaps also Hitachi. I get the impression these have some of the electromechanical features of expensive drives, but not the firmware.
- In Teradata’s low-end 1550 line — Hitachi 1-TB cheap drives.
All this is of course subject to change. In the short term that mainly means the possible use of alternate suppliers. As the Teradata product line is repeatedly refreshed, however, greater changes will occur. Some of the biggest are:
- A new SAS 2.0 standard will allow enterprise-class firmware for cheaper disks.
- The form factor for high-end disk drives will shrink from 3 1/2″ drives to 2 1/2″ drives of 1/2 the volume.
- The rotation speed sweet spot may actually decrease, to 10K RPM, with offsetting improvements to seek and latency so as not to cut performance. Power consumption benefits will ensue.
- There probably will be multi-TB SAS drives — “fat SAS.” SATA may be enhanced to compete with those. And by the way, SAS and SATA are electrically compatible, and hence could be combined in the same system.
I got the impression that at least the first three of these developments are expected soon, perhaps within a year.
And in a few years all of this will be pretty moot, because solid-state drives (SSDs) will be taking over. Carson thinks SSDs will have a 100X performance benefit versus disk drives, a figure that took me aback. However, he’s not yet sure about how fast SSDs will mature. Also complicating things is a possible transition some years down the road from SLC (Single-Level Cell) to MLC (Multi-Level Cell) SSDs. MLC SSDs which store multiple bits of information at once, are surely denser than SLC SSDs. I don’t know whether they’re more power efficient as well.
The main weirdnesses Carson sees in SSDs are those I’ve highlighted in the following quote from Wikipedia:
One limitation of flash memory is that although it can be read or programmed a byte or a word at a time in a random access fashion, it must be erased a “block” at a time. …
Another limitation is that flash memory has a finite number of erase-write cycles. … This effect is partially offset in some chip firmware or file system drivers by counting the writes and dynamically remapping blocks in order to spread write operations between sectors; this technique is called wear leveling. Another approach is to perform write verification and remapping to spare sectors in case of write failure, a technique called bad block management (BBM).
And finally, I unearthed a couple of non-storage tidbits, since I was talking with Carson anyway:
- Carson has become a 10 GigE “bigot”, and Teradata will soon certify 10 Gigabit Ethernet cards for connectivity to external systems. Carson’s interest in Infiniband, never high, went entirely away after Cisco decommitted to it. Obviously, this stands in contrast to the endorsements of Infiniband for data warehousing by Oracle and Microsoft.
- Intel’s Nehalem will be the basis for Teradata’s next server product. Carson is “ecstatic” with Intel at the moment, which is different from his stance at other times.
Comments
16 Responses to “Data warehouse storage options — cheap, expensive, or solid-state disk drives”
Leave a Reply
Thanks for the details. Have any independent studies been published to validate the claims about possible 10x performance differences?
Mark,
I don’t have serious data beyond what I posted. I’m hoping other folks with information will jump into the discussion.
I am sure they are right as long as their claim includes ‘could have’. I get plenty of mail and email telling me I could be a lottery winner.
I get 100+ IOPs and 50 MB/second from consumer grade 7200 RPM SATA disks at home, so a 15k disk needs to do 1000 IOPs and 500 MB/second to be 10X better or my cheap disk needs to do retries on almost every request. I don’t think that is typical. Maybe they had a bad batch of disks or very old disks.
There is SMART monitoring on disks that counts retries and other stats and there have been a few large scale studies based on this data. So the data is there, but it isn’t easy to accumulate at a large scale.
This paper is a good start and has references to other good papers.
http://www.google.com/url?sa=t&source=web&ct=res&cd=1&url=http%3A%2F%2Flabs.google.com%2Fpapers%2Fdisk_failures.pdf
[…] This post is about harddisks and why cheap (SATA) harddisks are much slower than expensive ones (Fibre channel/SAS). […]
Mark,
I don’t understand why you’re extrapolating from your home system to eBay’s data warehouse. Are the workloads similar?
In particular, Oliver tells me the problem usually doesn’t arise when there’s only one query running, especially if the query can be satisfied by quasi-sequential scans. How many simultaneous queries did you run your test with?
Thanks,
CAM
All: the 10x numbers have to be colored by the application behavior. As an application Teradata is a hash distributed architecture, meaning that all IO (ALL) is random. Teradata also happens to exploit massive amounts of IO – essentially highly optimized software specifically designed to exploit best in class brute force hardware.
When high session concurrency is factored into how the database operates, this results in a very large number of different IO paths in addition to the random placement of the data. There are certainly efforts by the entire tech stack to geographically colocate like data, but it is this random IO with concurrency environment which causes much, much greater head movement.
When you compute seek time and rotational time together in a 100% random block read environment – 10x is simple to see. At 7200 RPM, it takes 2x longer to read a track of data on a SATA drive – which also is likely to have several times more data per track, most of which is not needed for random IO.
With SATA Seek Latency of ~4x the FC drives, the two combine for something larger than 10x… this does not even count the compute and algorithmic issues or where in the tech stack the computation occurs. A highly simplistic and not entirely realistic example, but it should illustrate the point.
SATA disk systems compete very well in low concurrency large sequential block environments, which is entirely opposite the Teradata environment. So, the 10x number being quoted here is not a surprise.
@Michael – you are the first person to ever claim that a 15k enterprise-grade SAS disk can do 10x more IOPs than a 7200 RPM consumer-grade SATA disk. Congratulations.
@Curt – workload has nothing to do with it. Oliver has made a controversial claim with no substantiation. That is marketing and nothing else.
Mark,
Maybe the eBay guys diagnosed their situation correctly and maybe they didn’t, but I can’t begin to fathom your basis for saying that workload has nothing to do with it.
CAM
Curt,
I agree with you that workload has something to do with performance. Ignore the poor wording. I mean that you won’t get 10X more MB/s or IOPs from 15k SAS versus 7200 RPM SATA. Teradata has done clever things with track aligned reads to optimize disk performance. I would much rather read about that.
Check Andandtech for reviews of SSD. The latest is from 20 March 2009. Deals explicitly with some of the issues here. An earlier review dealt with the “block” write versus read.
The value of SSD is not going to be in highly redundant, flat-file (called whatever you want) style datastores; price will be too high. The value will be in high NF relational databases. Now, in my opinion (which you can read, and I am not alone), SSD will be the motivator that merges back OLTP with its various replicants. SSD, and the flash versions (both MLC and SLC) are only the latest low-end implementations (check Texas Memory Systems for one example of industrial strength SSD), removes the join penalty from 3/4/5NF databases.
The bottleneck will be in finding folks with enough smarts to embrace (again) Dr. Codd’s vision. The xml folk are not those kind of folk. My candidate is Larry Ellison. The reason is that the Oracle architecture, MVCC, is superior for OLTP (IBM finally just capitulated with entrpriseDB). With SSD, he can use the Oracle database, appropriately normalized, to support both without stars and snowflakes. A true one stop solution.
Robert,
I understand the appeal of saying something like “The reason we need to be aware of physical design is largely complex query performance. Complex query performance is an issue mainly because of I/O. If we have better storage technology, that problem goes away, and we can start ignoring physical design the way the theorists have always wanted us to.”
But I think we’re a long way from reaching that ideal, at best. Data warehouses are BIG, and getting bigger. They’ll push the limits of hardware technology for a long time to come.
[…] April, I wrote about the problems disk vibration can cause for data warehouse performance. Possible performance hits exceeded 10X, wild as that […]
Or someone can just go write a sensible DBMS that doesn’t force you to link the logical format with the physical. There’s no reason that several normalized relations can’t be stored as a single denormalized table on disk, if that happens to be best for the query load. Column-oriented systems are an example of a different storage method under a relational front-end, though they suffer just as badly from not being able to store things in a row-oriented manner when that makes more sense.
[…] Teradata and Greenplum, Oliver previously indicated he was inclined to attribute this more to specific Sun Thumper hardware/storage choices than to […]
Dear all,
Does anybody know what is the cost of additional storage of 1 TB added to an existing Warehouse. My client company is having SYbase IQ Datawarehouse, and I’m just curious to know what would be the incremental cost of 1TB, coz they might add upto 3.
Regards,
Sai.
I think you’d do best to check with Sybase on that. Prices change too often for me to have that memorized.
On the plus side, they often have fairly clear web pages with their list pricing.