Netezza
Analysis of Netezza and its data warehouse appliances. Related subjects include:
Generally available Kudu
I talked with Cloudera about Kudu in early May. Besides giving me a lot of information about Kudu, Cloudera also helped confirm some trends I’m seeing elsewhere, including:
- Security is an ever bigger deal.
- There’s a lot of interest in data warehouses (perhaps really data marts) that are updated in human real-time.
- Prospects for that respond well to the actual term “data warehouse”, at least when preceded by some modifier to suggest that it’s modern/low-latency/non-batch or whatever.
- Flash is often — but not yet always — preferred over disk for that kind of use.
- Sometimes these data stores are greenfield. When they’re migrations, they come more commonly from analytic RDBMS or data warehouse appliance (the most commonly mentioned ones are Teradata, Netezza and Vertica, but that’s perhaps just due to those product lines’ market share), rather than from general purpose DBMS such as Oracle or SQL Server.
- Intel is making it ever easier to vectorize CPU operations, and analytic data managers are increasingly taking advantage of this possibility.
Now let’s talk about Kudu itself. As I discussed at length in September 2015, Kudu is:
- A data storage system introduced by Cloudera (and subsequently open-sourced).
- Columnar.
- Updatable in human real-time.
- Meant to serve as the data storage tier for Impala and Spark.
Kudu’s adoption and roll-out story starts: Read more
Are analytic RDBMS and data warehouse appliances obsolete?
I used to spend most of my time — blogging and consulting alike — on data warehouse appliances and analytic DBMS. Now I’m barely involved with them. The most obvious reason is that there have been drastic changes in industry structure:
- Many of the independent vendors were swooped up by acquisition.
- None of those acquisitions was a big success.
- Microsoft did little with DATAllegro.
- Netezza struggled with R&D after being bought by IBM. An IBMer recently told me that their main analytic RDBMS engine was BLU.
- I hear about Vertica more as a technology to be replaced than as a significant ongoing market player.
- Pivotal open-sourced Greenplum. I have detected few people who care.
- Ditto for Actian’s offerings.
- Teradata claimed a few large Aster accounts, but I never hear of Aster as something to compete or partner with.
- Smaller vendors fizzled too. Hadapt and Kickfire went to Teradata as more-or-less acquihires. InfiniDB folded. Etc.
- Impala and other Hadoop-based alternatives are technology options.
- Oracle, Microsoft, IBM and to some extent SAP/Sybase are still pedaling along … but I rarely talk with companies that big. 🙂
Simply reciting all that, however, begs the question of whether one should still care about analytic RDBMS at all.
My answer, in a nutshell, is:
Analytic RDBMS — whether on premises in software, in the form of data warehouse appliances, or in the cloud — are still great for hard-core business intelligence, where “hard-core” can refer to ad-hoc query complexity, reporting/dashboard concurrency, or both. But they aren’t good for much else.
Which analytic technology problems are important to solve for whom?
I hear much discussion of shortfalls in analytic technology, especially from companies that want to fill in the gaps. But how much do these gaps actually matter? In many cases, that depends on what the analytic technology is being used for. So let’s think about some different kinds of analytic task, and where they each might most stress today’s available technology.
In separating out the task areas, I’ll focus first on the spectrum “To what extent is this supposed to produce novel insights?” and second on the dimension “To what extent is this supposed to be integrated into a production/operational system?” Issues of latency, algorithmic novelty, etc. can follow after those. In particular, let’s consider the tasks: Read more
Categories: Business intelligence, Data warehousing, Databricks, Spark and BDAS, Hadoop, Netezza, NoSQL, Predictive modeling and advanced analytics, Tableau Software | 1 Comment |
Is analytic data management finally headed for the cloud?
It seems reasonable to wonder whether analytic data management is headed for the cloud. In no particular order:
- Amazon Redshift appears to be prospering.
- So are some SaaS (Software as a Service) business intelligence vendors.
- Amazon Elastic MapReduce is still around.
- Snowflake Computing launched with a cloud strategy.
- Cazena, with vague intentions for cloud data warehousing, destealthed.*
- Cloudera made various cloud-related announcements.
- Data is increasingly machine-generated, and machine-generated data commonly originates off-premises.
- The general argument for cloud-or-at-least-colocation has compelling aspects.
- Analytic workloads can be “bursty”, and so could benefit from true cloud elasticity.
Categories: Amazon and its cloud, Cloud computing, Data warehousing, Netezza | 3 Comments |
Thoughts on SaaS
Generalizing about SaaS (Software as a Service) is hard. To prune some of the confusion, let’s start by noting:
- SaaS has been around for over half a century, and at times has been the dominant mode of application delivery.
- The term multi-tenancy is being used in several different ways.
- Multi-tenancy, in the purest sense, is inessential to SaaS. It’s simply an implementation choice that has certain benefits for the SaaS provider. And by the way, …
- … salesforce.com, the chief proponent of the theory that true multi-tenancy is the hallmark of true SaaS, abandoned that position this week.
- Internet-based services are commonly, if you squint a little, SaaS. Examples include but are hardly limited to Google, Twitter, Dropbox, Intuit, Amazon Web Services, and the company that hosts this blog (KnownHost).
- Some of the core arguments for SaaS’ rise, namely the various efficiencies of data center outsourcing and scale, apply equally to the public cloud, to SaaS, and to AEaaS (Anything Else as a Service).
- These benefits are particularly strong for inherently networked use cases. For example, you really don’t want to be hosting your website yourself. And salesforce.com got its start supporting salespeople who worked out of remote offices.
- In theory and occasionally in practice, certain SaaS benefits, namely the outsourcing of software maintenance and updates, could be enjoyed on-premises as well. Whether I think that could be a bigger deal going forward will be explored in future posts.
For smaller enterprises, the core outsourcing argument is compelling. How small? Well:
- What’s the minimum level of IT operations headcount needed for mission-critical systems? Let’s just say “several”.
- What does that cost? Fully burdened, somewhere in the six figures.
- What fraction of the IT budget should such headcount be? As low a double digit percentage as possible.
- What fraction of revenues should be spent on IT? Some single-digit percentage.
So except for special cases, an enterprise with less than $100 million or so in revenue may have trouble affording on-site data processing, at least at a mission-critical level of robustness. It may well be better to use NetSuite or something like that, assuming needed features are available in SaaS form.*
How Revolution Analytics parallelizes R
I talked tonight with Lee Edlefsen, Chief Scientist of Revolution Analytics, and now think I understand Revolution’s parallel R much better than I did before.
There are four primary ways that people try to parallelize predictive modeling:
- They can run the same algorithm on different parts of a dataset on different nodes, then return all the results, and claim they’ve parallelized. This is trivial and not really a solution. It is also the last-ditch fallback position for those who parallelize more seriously.
- They can generate intermediate results from different parts of a dataset on different nodes, then generate and return a single final result. This is what Revolution does.
- They can parallelize the linear algebra that underlies so many algorithms. Netezza and Greenplum tried this, but I don’t think it worked out very well in either case. Lee cited a saying in statistical computing “If you’re using matrices, you’re doing it wrong”; he thinks shortcuts and workarounds are almost always the better way to go.
- They can jack up the speed of inter-node communication, perhaps via MPI (Messaging Passing Interface), so that full parallelization isn’t needed. That’s SAS’ main approach.
One confusing aspect of this discussion is that it could reference several heavily-overlapping but not identical categories of algorithms, including:
- External memory algorithms, which operates on datasets too big to fit in main memory, by — for starters — reading in and working on a part of the data at a time. Lee observes that these are almost always parallelizable.
- What Revolution markets as External Memory Algorithms, which are those external memory algorithms it has gotten around to implementing so far. These are all parallelized. They are also all in the category of …
- … algorithms that can be parallelized by:
- Operating on data in parts.
- Getting intermediate results.
- Combining them in some way for a final result.
- Algorithms of the previous category, where the way of combining them specifically is in the form of summation, such as those discussed in the famous paper Map-Reduce for Machine Learning on Multicore. Not all of Revolution’s current parallel algorithms fall into this group.
To be clear, all Revolution’s parallel algorithms are in Category #2 by definition and Category #3 in practice. However, they aren’t all in Category #4.
Categories: Greenplum, Hadoop, MapReduce, Netezza, Parallelization, Predictive modeling and advanced analytics, Revolution Analytics, Teradata | Leave a Comment |
RDBMS and their bundle-mates
Relational DBMS used to be fairly straightforward product suites, which boiled down to:
- A big SQL interpreter.
- A bunch of administrative and operational tools.
- Some very optional add-ons, often including an application development tool.
Now, however, most RDBMS are sold as part of something bigger.
- Oracle has hugely thickened its stack, as part of an Innovator’s Solution strategy — hardware, middleware, applications, business intelligence, and more.
- IBM has moved aggressively to a bundled “appliance” strategy. Even before that, IBM DB2 long sold much better to committed IBM accounts than as a software-only offering.
- Microsoft SQL Server is part of a stack, starting with the Windows operating system.
- Sybase was an exception to this rule, with thin(ner) stacks for both Adaptive Server Enterprise and Sybase IQ. But Sybase is now owned by SAP, and increasingly integrated as a business with …
- … SAP HANA, which is closely associated with SAP’s applications.
- Teradata has always been a hardware/software vendor. The most successful of its analytic DBMS rivals, in some order, are:
- Netezza, a pure appliance vendor, now part of IBM.
- Greenplum, an appliance-mainly vendor for most (not all) of its existence, and in particular now as a part of EMC Pivotal.
- Vertica, more of a software-only vendor than the others, but now owned by and increasingly mainstreamed into hardware vendor HP.
- MySQL’s glory years were as part of the “LAMP” stack.
- Various thin-stack RDBMS that once were or could have been important market players … aren’t. Examples include Progress OpenEdge, IBM Informix, and the various strays adopted by Actian.
“Disruption” in the software industry
I lampoon the word “disruptive” for being badly overused. On the other hand, I often refer to the concept myself. Perhaps I should clarify. 🙂
You probably know that the modern concept of disruption comes from Clayton Christensen, specifically in The Innovator’s Dilemma and its sequel, The Innovator’s Solution. The basic ideas are:
- Market leaders serve high-end customers with complex, high-end products and services, often distributed through a costly sales channel.
- Upstarts serve a different market segment, often cheaply and/or simply, perhaps with a different business model (e.g. a different sales channel).
- Upstarts expand their offerings, and eventually attack the leaders in their core markets.
In response (this is the Innovator’s Solution part):
- Leaders expand their product lines, increasing the value of their offerings in their core markets.
- In particular, leaders expand into adjacent market segments, capturing margins and value even if their historical core businesses are commoditized.
- Leaders may also diversify into direct competition with the upstarts, but that generally works only if it’s via a separate division, perhaps acquired, that has permission to compete hard with the main business.
But not all cleverness is “disruption”.
- Routine product advancement by leaders — even when it’s admirably clever — is “sustaining” innovation, as opposed to the disruptive stuff.
- Innovative new technology from small companies is not, in itself, disruption either.
Here are some of the examples that make me think of the whole subject. Read more
Data skipping
Way back in 2006, I wrote about a cool Netezza feature called the zone map, which in essence allows you to do partition elimination even in the absence of strict range partitioning.
Netezza’s substitute for range partitioning is very simple. Netezza features “zone maps,” which note the minimum and maximum of each column value (if such concepts are meaningful) in each extent. This can amount to effective range partitioning over dates; if data is added over time, there’s a good chance that the data in any particular date range is clustered, and a zone map lets you pick out which data falls in the desired data range.
I further wrote
… that seems to be the primary scenario in which zone maps confer a large benefit.
But I now think that part was too pessimistic. For example, in bulk load scenarios, it’s easy to imagine ways in which data can be clustered or skewed. And in such cases, zone maps can let you skip a large fraction of potential I/O.
Over the years I’ve said that other things were reminiscent of Netezza zone maps, e.g. features of Infobright, SenSage, InfiniDB and even Microsoft SQL Server. But truth be told, when I actually use the phrase “zone map”, people usually give me a blank look.
In a recent briefing about BLU, IBM introduced me to a better term — data skipping. I like it and, unless somebody comes up with a good reason not to, I plan to start using it myself. 🙂
Categories: Data warehousing, IBM and DB2, Netezza, Theory and architecture | 12 Comments |
DBMS development and other subjects
The cardinal rules of DBMS development
Rule 1: Developing a good DBMS requires 5-7 years and tens of millions of dollars.
That’s if things go extremely well.
Rule 2: You aren’t an exception to Rule 1.
In particular:
- Concurrent workloads benchmarked in the lab are poor predictors of concurrent performance in real life.
- Mixed workload management is harder than you’re assuming it is.
- Those minor edge cases in which your Version 1 product works poorly aren’t minor after all.
DBMS with Hadoop underpinnings …
… aren’t exceptions to the cardinal rules of DBMS development. That applies to Impala (Cloudera), Stinger (Hortonworks), and Hadapt, among others. Fortunately, the relevant vendors seem to be well aware of this fact. Read more