More on temp space, compression, and “random” I/O
My PhD was in a probability-related area of mathematics (game theory), so I tend to squirm when something is described as “random” that clearly is not. That said, a comment by Shilpa Lawande on our recent flash/temp space discussion suggests the following way of framing a key point:
- You really, really want to have multiple data streams coming out of temp space, as close to simultaneously as possible.
- The storage performance characteristics of such a workload are more reminiscent of “random” than “sequential” I/O.
If everybody else is cool with it too, I can live with that. 🙂
Meanwhile, I talked again with Tim Vincent of IBM this afternoon. Tim endorsed the temp space/Flash fit, but with a different emphasis, which upon review I find I don’t really understand. The idea is:
- Analytic DBMS processing generally stresses reads over writes.
- Temp space is an exception — read and write use of temp space is pretty balanced. (You spool data out once, you read it back in once, and that’s the end of that; next time it will be overwritten.)
My problem with that is: Flash typically has lower write than read IOPS (I/O per second), so being (relatively) write-intensive would, to a first approximation, seem if anything to disfavor a workload for flash.
On the plus side, I was reminded of something I should have noted when I wrote about DB2 compression before:
Much like Vertica, DB2 operates on compressed data all the way through, including in temp space.
Comments
6 Responses to “More on temp space, compression, and “random” I/O”
Leave a Reply
[…] By way of contrast, Tim is cautious about the common approach of just lowering a query’s priority. His concern is that a long-running query could linger even longer, creating a long-lasting bottleneck in, for example, temp space. […]
Well, for multiple data streams maybe “concurrency” is the key word, if data needs to be fetched from several places on the storage device concurrently, that places rather heavy load on the read heads of rotating disks. And temp space often is used by several sessions concurrently.
It is also used for sort segments and hashes, which have random access patterns; for example, this presentation shows that SSDs are a good fit for that:
http://www.cs.arizona.edu/~bkmoon/papers/sigmod08ssd-slides.pdf , slides 17-20.
Slide 19 is really interesting. Thanks!
Related to IBM (Tim) comment
In early times of DB2, temp storage was mostly used for sorting output from queries like ORDER BY. In most of this queries, not in all i.e. GROUP BY, the number of writes approach the number of reads.
Talking about concurrency. I haven’t noticed any degradation, but 400% performance improvement.
http://code.google.com/p/mist01/wiki/Vertica_demystified
Is this because I’ve used relatively small datasets for this test?
[…] you can choose to put just your most bottlenecking data on Kaminario K2 – the hot stuff, your temp space, your logs, […]