Estimating user data vs. spinning disk
There’s a lot of confusion about how to measure data warehouse database size. Major complicating factors include:
- Indexes and temporary working space. That’s what I emphasized a couple of years ago in my post about Expansion Ratios.
- Compression. I write about database compression a lot.
- Disk redundancy. I usually gloss over that one, but I’ll try to make amends in this post.
- Replication other than that which is primarily designed for redundancy. I usually gloss over that one too, and I think it’s safe to continue doing so. That’s because data warehouse replication – at least in most of the system architectures I know of – generally divides into three categories:
- a lot like redundancy
- a lot like an index
- only a minor issue (e.g., when small fact tables are replicated across each node of an MPP cluster)
Greenplum’s CTO Luke Lonergan recently walked me through the general disk usage arithmetic for Greenplum’s most common configuration (Sun Thors*, configured to Raid 10). I found it pretty interesting, and a good guide to factors that also affect other systems, from other vendors.
*I presume that “Thor” is the successor to “Thumper” because the names alliterate, and Thor thumped things
Here goes.
- A Thor box has 48 disks.
- 2 disks are used for the operating system (i.e., software) rather than data.
- 2 more disks are reserved as hot spares.
- Thus, there are 44 disks that can store data.
- Raid 10 knocks off a factor of 2.
- General wastage from formatting knocks off a factor of 10% or so.
- Temp space is .33 of user data. Internode mirroring adds another 1X. Taken together, those two considerations knock off another factor of 2.33.
Thus, before factoring in compression, the amount of user data one can put on a 48-disk box is 8.5X the rated capacity of one disk (which may be ¼, ½, or 1 terabyte). However, just to be on the safe side, what Greenplum actually quotes is a factor of 6X, for what one might call a “loss factor” of 8:1.
And just to confuse things — compression can get most or all of that back. For example, at a multi-petabyte customer that is loading up its Greenplum/Thor machines now, early indications suggest a compression factor of 7.5X. (I didn’t actually ask, but I assume that’s including indexes, as is common when discussing overall compression figures. As I understand the application(s), there probably aren’t a lot of indexes anyway.)
And so, after all those calculations, the amount of user data winds up to be almost exactly equal to the amount of spinning disk. (But for a vendor or database with a different compression ratio, that rough equality would of course not hold.)
That assumes I got it all right, of course. There was quite a lot of telephone discussion and back-and-forth e-mail to get even this far, so one or more errors could have easily slipped through.
Comments
5 Responses to “Estimating user data vs. spinning disk”
Leave a Reply
Yep – it’s right enough, but confusing!
One key thing to note – only the RAID10 consumes spindles, the rest of the overhead consumes space on the devices. Spindles are expensive, space is dirt cheap.
In other words, with RAID10, you need physically distinct devices to enable fault tolerance.
The inter-host mirroring we provide consumes disk *space* on mirror hosts, allocated from the same group of RAID’ed spindles that the primary space is on.
[…] has 360 terabytes of spinning disk, which suggests the implementation there might have one level less of redundancy than some other systems […]
[…] recent foray into measuring disk storage pales by comparison. Share: These icons link to social bookmarking sites where readers can share […]
[…] x 300 MB and 12 x 1 TB of spinning disk respectively. That’s a 1:3.6 ratio, vs. the 1:8 ratio Greenplum quotes. Differences include 4% of Greenplum’s disks being used for hot spares (Oracle’s […]
[…] Автор: Curt Monash Дата публикации оригинала: 2008-09-01 Перевод: Олег Кузьменко Источник: Блог Курта Монаша […]