A partial overview of Netezza database software technology
Netezza is having its user conference Enzee Universe in Boston Monday–Wednesday, June 21-23, and naturally will be announcing new products there, and otherwise providing hooks and inducements to get itself written about. (The preliminary count is seven press releases in all.) To get a head start, I stopped by Netezza Thursday for meetings that included a 3 ½ hour session with 10 or so senior engineers, and have exchanged some clarifying emails since.
It might be best to start with some Netezza product introduction and naming housekeeping:
- Netezza isn’t changing the hardware on any of its existing systems at this time. Rather, Netezza’s product upgrades are contained in a software-only release …
- … except that it isn’t a “release,” but rather a “wave.” There are three points to that terminological distinction:
- The advanced analytics part doesn’t depend on the new database platform software.
- Individual functions in the advanced analytics part don’t necessarily depend on advances in the analytics platform.
- It plays on the surfboard-centric naming of Netezza’s appliances. 🙂
- Netezza has wisely scrapped its prior plan to make its advanced-analytics capabilities be a chargeable add-on to it core appliance products. Rather, Netezza is going to offer advanced analytics as part of its core product. Part of the reason is that the interest in these capabilities is broader than Netezza first anticipated. The name for this is is something like i-Class advanced analytics capabilities.
- There is a “release” in all this too, namely NPS 6.0 (Netezza Performance Software). That’s the core DBMS technology.
- It’s all to be shipped in Q3.
Highlights of our NPS 6.0 conversation include:
- As promised, Netezza has improved its compression significantly. Because this was anticipated, this upgrade was planned for in the design of the systems Netezza started introducing last summer. Consequently, the reduction in I/O produced by compression translates almost directly into better performance – the silicon is now more fully loaded than it was before, but few if any actual silicon bottlenecks have been introduced by the I/O improvement.
- Netezza’s other big performance enhancement is the introduction of clustered base tables, which it says can reduce I/O by an order of magnitude or better.
- Netezza says that there are individual queries in which the enhancements take query performance up 30-40X. (Presumably, those would be ones for which clustered base tables are a big win.)
- More interestingly, Netezza says that overall performance is improved by >2X. That’s queries, load, backup, and everything else all blended together.
- Underpinning all this, Netezza went from 125 MHz to a blend of 125 and 250 MHz in its FPGA clock speeds. Also, the width of the FPGA onboard data path went from 16 to 32 bits. Netezza suggests that the naive calculation which says this could increase FPGA throughput 4X isn’t entirely misleading.
- Netezza is pretty content with its workload management capabilities for queries, but nonetheless keeps adding features. Workload management has not yet been extended to cover all the non-query parts of the analytic functionality.
- Netezza continues to enhance its cost-based optimizer and query planner.
- Netezza has long used an internal networking approach that’s rather different from TCP/IP. Netezza views TCP/IP’s strength as recovering gracefully if there’s congestion. However, Netezza would rather do whatever it takes to preclude congestion in the first place, except perhaps in rare edge cases. I’m not aware of what enhancements, if any, have been made to Netezza’s internal networking specifically in NPS 6.0.
The basic idea of clustered base tables (“base tables” are ones that are not, for example, materialized views) is to range partition in multiple dimensions at once. Then you rule out (as in don’t retrieve) all those blocks that fail a match in any one of the cluster dimensions. Netezza says its customers were doing a lot of work to simulate this benefit by multiple sorts; Netezza’s implementation will now handle that much more automatically. Netezza says that talking to customers revealed that 4-5 cluster dimensions was almost always the most somebody would need; they will ship support for 4. That makes sense. In most cases, you’d want to cluster on the answers to “W” questions – Who, What, Where, When (but probably not Why), in one dimension each. However, Netezza does call out as an ideal use case geospatial, precisely because 2 (or more rarely 3) dimensions each have “equal weight.”
I don’t know how other vendors implement clustered base tables, but in Netezza’s case it’s via a space-filling curve. (Actually, they called it a “Hilbert space-filling curve,” but I oppose that phrasing, as it’s apt to lead to extremely incorrect use of the term “Hilbert space.”) I.e., data is mapped to 4-tuples (say) in line with the dimensions, which are then sorted in a linear order in line with a space-filling curve. Happily, Netezza hasn’t experienced problems clustering columns that have particularly challenging cardinality or skew.
If I understood correctly, you can only zone map (and presumably cluster) on integers and dates right now, but that will change soon. (Edit: In blog comments and email, Tim Greenwood of Netezza explained to me that the NPS 6.0 workarounds to that were much more robust than I realized.)
Netezza put a lot of work for NPS 6 into something it calls “table grooming,” which amounts to recopying tables in more beneficial form. Uses for table grooming – which is a manually initiated process – include but probably aren’t limited to:
- Clustering tables and, as needed, reclustering them.
- Getting rid of data that was deleted. (Netezza has Postgres-style multiversion concurrency control – MVCC – but no time-travel, so keeping around deleted data is a waste of space.)
- Recompressing data from Compress Engine 1 to Compress Engine 2.
- Alter Table
The core ideas of table grooming include:
- The Netezza NPS software copies rows from one place to another.
- Netezza NPS then updates the appropriate metadata.
- Metadata updates are transactional, even though the actual data movement is not.
This can be done part of a table at a time. Reads and loads are unaffected by the process, or at least not blocked. Delete commits are indeed blocked during a reorg, but Netezza guesses that the blocks hold for a few minutes during the grooming of a clustered base table, 10-15 seconds if space is being reclaimed, and something similar for an Alter Table.
And finally, here are some notes on Netezza’s query optimization and planning.
- Netezza has a traditional cost-based optimizer, in which all operations have estimated costs, measured in microseconds, irrespective of which parts of the system (CPU, I/O, network, whatever) they most stress. (I have trouble imagining how a cost-based optimizer could work differently from that without incurring huge computational costs.)
- Netezza’s bottleneck is almost always disk I/O.
- Netezza’s optimizer is not/no longer based on the PostgreSQL optimizer.
- Netezza does a lot of query transformation. Key points include:
- Netezza joins are usually very cheap.
- Filtered scans are cheap too.
- More expensive in Netezza are data redistribution (duh), sorts, and unfiltered scans.
- Most expensive of all are intermediate result sets that don’t fit into memory.
- Specific examples of Netezza query transformation include:
- Pushing predicates out to nodes.’
- Flattening query trees and eliminating subqueries.
- Rewriting windowed aggregates to be joins + grouped aggregates.
- (New in 6.0) Transforming outer joins into other kinds.
- Netezza does real-time sampling to help with query planning. (But this is only worth doing for queries that are estimated to be expensive.) Zone maps (and clustering too?) are invoked as part of deciding where to sample. Sampling was for scans only prior to NPS 6.0, and will now be done for joins as well.
Related links
- Notes on this week’s spate of Netezza-related blog posts
- How Netezza (and IBM) do database compression
- Netezza’s silicon balance
Comments
15 Responses to “A partial overview of Netezza database software technology”
Leave a Reply
[…] I’ve mentioned in a couple of other posts, Netezza is stressing that the most recent wave of its technology is software-only, with no hardware upgrades made or needed. In other words, […]
[…] I spent 3 ½ hours talking with 10 of Netezza’s more senior engineers. Friday, I talked for 1 ½ hours with IBM Fellow and DB2 Chief Architect Tim Vincent, and we agreed […]
[…] took advantage of my recent conversations with Netezza and IBM to discuss what kinds of data warehouse load latency were practical. In both cases I got […]
[…] A long discussion of Netezza’s technology, focusing on the database parts […]
You wrote “If I understood correctly, you can only zone map (and presumably cluster) on integers and dates right now, but that will change soon.”
Integers and dates are zone mapped by default, but you can zone map, and cluster on all data types except for numeric of size > 18.
@Tim,
I thought that functionality didn’t make it out of QA for NPS 6.0. Did I misunderstand?
In NPS 6.0 you can cluster (organize on) all datatypes except for numeric of size > 18. These types are zone mapped by including them in the ORDER BY clause of CREATE MATERIALIZED VIEW
@Tim,
Thanks!
Next question — why do zone maps have anything to do with materialized views?
CAM
There are many “space filling curves” of which “Hilbert space filling curves” are a particular instance.
How do “clustered tables” differ from DB2’s multidimensional clustering?
Daniel,
I have just begun to look into DB2 in detail. I haven’t gotten to the clustered tables part.
[…] Netezza (user conference) […]
[…] A partial overview of Netezza database software technology […]
[…] TwinFin i-Class was renamed/repackaged/repriced before it ever shipped. Even so, when Tim Young or Phil Francisco tries to recall exactly the […]
[…] of Netezza’s nzMatrix, Greenplum has built in some linear algebra capabilities as a building block for analytics. In […]