Netezza
Analysis of Netezza and its data warehouse appliances. Related subjects include:
Our clients, and where they are located
From time to time, I disclose our vendor client lists. Another iteration is below, the first since a little over a year ago. To be clear:
- This is a list of Monash Advantage members.
- All our vendor clients are Monash Advantage members, unless …
- … we work with them primarily in their capacity as technology users. (A large fraction of our user clients happen to be SaaS vendors.)
- We do not usually disclose our user clients.
- We do not usually disclose our venture capital clients, nor those who invest in publicly-traded securities.
- Excluded from this round of disclosure is one vendor I have never written about.
- Included in this round of disclosure is one client paying for services partly in stock. All our other clients are cash-only.
For reasons explained below, I’ll group the clients geographically. Obviously, companies often have multiple locations, but this is approximately how it works from the standpoint of their interactions with me. Read more
Juggling analytic databases
I’d like to survey a few related ideas:
- Enterprises should each have a variety of different analytic data stores.
- Vendors — especially but not only IBM and Teradata — are acknowledging and marketing around the point that enterprises should each have a number of different analytic data stores.
- In addition to having multiple analytic data management technology stacks, it is also desirable to have an agile way to spin out multiple virtual or physical relational data marts using a single RDBMS. Vendors are addressing that need.
- Some observers think that the real essence of analytic data management will be in data integration, not the actual data management.
Here goes. Read more
Hope for a new PostgreSQL era?
In a comedy of briefing errors, I’m not too clear on the details of my client salesforce.com’s new PostgreSQL-as-a-service offering, nor exactly on what my clients at VMware are bringing to the PostgreSQL virtualization/cloud party. That said:
- PostgreSQL is good technology.
- MySQL is narrowing the gap, but PostgreSQL is still ahead of MySQL in some ways. (Database extensibility if nothing else.)
- PostgreSQL has a lot of users. (Many of them in academia and/or Russia.)
- Neither EnterpriseDB (which now calls itself “The enterprise PostgreSQL company”) nor the PostgreSQL community leadership have covered themselves with stewardship glory.
- A significant number of interesting DBMS products can be regarded as PostgreSQL forks (e.g. Greenplum, Aster Data nCluster, Netezza if you squint, and Vertica if you stand on your head*).
- PostgreSQL advancement is not dead. For example, Hadapt beta users are running actual PostgreSQL on many nodes each.
- There’s no assurance that Oracle will be a benevolent MySQL steward forever. (Specifically, Oracle’s “Play nicely with others” antitrust commitments expire in 2014.)
So I think it would be cool if one or the other big company put significant wood behind the PostgreSQL arrow.
*While Vertica was originally released using little or no PostgreSQL code — reports varied — it featured high degrees of PostgreSQL compatibility.
Categories: Aster Data, EnterpriseDB and Postgres Plus, Greenplum, MySQL, Netezza, Open source, salesforce.com, Vertica Systems | 8 Comments |
Some big-vendor execution questions, and why they matter
When I drafted a list of key analytics-sector issues in honor of look-ahead season, the first item was “execution of various big vendors’ ambitious initiatives”. By “execute” I mean mainly:
- “Deliver products that really meet customers’ desires and needs.”
- “Successfully convince them that you’re doing so …”
- “… at an attractive overall cost.”
Vendors mentioned here are Oracle, SAP, HP, and IBM. Anybody smaller got left out due to the length of this post. Among the bigger omissions were:
- salesforce.com (multiple subjects).
- SAS HPA.
- The evolution of Hadoop.
Remote machine-generated data
I refer often to machine-generated data, which is commonly generated inexpensively and in log-like formats, and is often best aggregated in a big bit bucket before you try to do much analysis on it. The term has caught on, to the point that perhaps it’s time to distinguish more carefully among different kinds of machine-generated data. In particular, I think it may be useful to distinguish between:
- Log-stream machine-generated data, when what you’re looking at — at least initially — is the entire output of verbose logging systems.
- Remote machine-generated data.
Here’s what I’m thinking of for the second category. I rather frequently hear of cases in which data is generated by large numbers of remote machines, which occasionally send messages home. For example: Read more
Categories: Analytic technologies, Cloud computing, Log analysis, MySQL, Netezza, Splunk, Truviso | 2 Comments |
Eight kinds of analytic database (Part 2)
In Part 1 of this two-part series, I outlined four variants on the traditional enterprise data warehouse/data mart dichotomy, and suggested what kinds of DBMS products you might use for each. In Part 2 I’ll cover four more kinds of analytic database — even newer, for the most part, with a use case/product short list match that is even less clear. Read more
Eight kinds of analytic database (Part 1)
Analytic data management technology has blossomed, leading to many questions along the lines of “So which products should I use for which category of problem?” The old EDW/data mart dichotomy is hopelessly outdated for that purpose, and adding a third category for “big data” is little help.
Let’s try eight categories instead. While no categorization is ever perfect, these each have at least some degree of technical homogeneity. Figuring out which types of analytic database you have or need — and in most cases you’ll need several — is a great early step in your analytic technology planning. Read more
It’s official — the grand central EDW will never happen
I pointed out last year that the grand central enterprise data warehouse couldn’t happen; the post started:
An enterprise data warehouse should:
- Manage data to high standards of accuracy, consistency, cleanliness, clarity, and security.
- Manage all the data in your organization.
Pick ONE.
IBM’s main theme at the Enzee Universe conference has been to say the same thing.
Merv Adrian’s talk at the same conference made it clear that Gartner feels the same way, as does he personally. Indeed, like me, he’s racked up multiple decades of industry experience without ever finding a single theoretically ideal grand central EDW.
Forrester Research has been a little less clear on the point, but generally seems to be on the correct side of the issue as well.
If somebody is still saying that one central enterprise data warehouse can hold all the information or data you need on which to base your business decisions, they’re probably not somebody you should be listening to very hard.
Is that clear, or should I hammer home the point even harder? 😀
Categories: Data warehousing, IBM and DB2, Netezza | 8 Comments |
Investigative analytics and derived data: Enzee Universe 2011 talk
I’ll be speaking Monday, June 20 at IBM Netezza’s Enzee Universe conference. Thus, as is my custom:
- I’m posting draft slides.
- I’m encouraging comment (especially in the short time window before I have to actually give the talk).
- I’m offering links below to more detail on various subjects covered in the talk.
The talk concept started out as “advanced analytics” (as opposed to fast query, a subject amply covered in the rest of any Netezza event), as a lunch break in what is otherwise a detailed “best practices” session. So I suggested we constrain the subject by focusing on a specific application area — customer acquisition and retention, something of importance to almost any enterprise, and which exploits most areas of analytic technology. Then I actually prepared the slides — and guess what? The mix of subjects will be skewed somewhat more toward generalities than I first intended, specifically in the areas of investigative analytics and derived data. And, as always when I speak, I’ll try to raise consciousness about the issues of liberty and privacy, our options as a society for addressing them, and the crucial role we play as an industry in helping policymakers deal with these technologically-intense subjects.
Slide 3 refers back to a post I made last December, saying there are six useful things you can do with analytic technology:
- Operational BI/Analytically-infused operational apps: You can make an immediate decision.
- Planning and budgeting: You can plan in support of future decisions.
- Investigative analytics (multiple disciplines): You can research, investigate, and analyze in support of future decisions.
- Business intelligence: You can monitor what’s going on, to see when it necessary to decide, plan, or investigate.
- More BI: You can communicate, to help other people and organizations do these same things.
- DBMS, ETL, and other “platform” technologies: You can provide support, in technology or data gathering, for one of the other functions.
Slide 4 observes that investigative analytics:
- Is the most rapidly advancing of the six areas …
- … because it most directly exploits performance & scalability.
Slide 5 gives my simplest overview of investigative analytics technology to date: Read more
Alternatives for Hadoop/MapReduce data storage and management
There’s been a flurry of announcements recently in the Hadoop world. Much of it has been concentrated on Hadoop data storage and management. This is understandable, since HDFS (Hadoop Distributed File System) is quite a young (i.e. immature) system, with much strengthening and Bottleneck Whack-A-Mole remaining in its future.
Known HDFS and Hadoop data storage and management issues include but are not limited to:
- Hadoop is run by a master node, and specifically a namenode, that’s a single point of failure.
- HDFS compression could be better.
- HDFS likes to store three copies of everything, whereas many DBMS and file systems are satisfied with two.
- Hive (the canonical way to do SQL joins and so on in Hadoop) is slow.
Different entities have different ideas about how such deficiencies should be addressed. Read more
Categories: Aster Data, Cassandra, Cloudera, Data warehouse appliances, DataStax, EMC, Greenplum, Hadapt, Hadoop, IBM and DB2, MapReduce, MongoDB, Netezza, Parallelization | 22 Comments |