Introduction to Aster Data and nCluster
I’ve been writing a lot about Greenplum since a recent visit. But on the same trip I met with Aster Data, and have talked with them further since. Let me now redress the balance and outline some highlights of the Aster Data story.
The basics include:
- Aster Data is a row-based MPP shared-nothing analytic DBMS vendor.
- Aster Data’s main product name is Aster nCluster.
- There’s one Aster nCluster customer with >100 terabytes of user data: MySpace, which has been in production for almost a year. There are a few other customers; all have around a “couple” of terabytes.
- MySpace has 360 terabytes of spinning disk, which suggests the implementation there might have one level less of redundancy than some other systems do.
- The main use of nCluster to date is clickstream analysis, notably for user segmentation, and related applications. Aster attributes this in part to the fact that its sales efforts have just been local – i.e., in the Bay Area.
- Aster Data’s best concurrency proof point is that Aggregate Knowledge, the other named customer, runs about 100 simultaneous queries hourly, in connection with the loading of data.
- Aster Data was one of two MPP analytic DBMS vendors to announce MapReduce support last week.
- nCluster is based on PostgreSQL. (So says other reporting. I actually forgot to ask about that point, and will circle back to it later.)
- nCluster’s tool support includes Microstrategy, Business Objects (working on certification), Pentaho (being installed), and SAS (a project “going on”).
- nCluster has no compression yet, but Aster is working on fixing that.
- The Aster Data guys built their prototype and got angel funding in 2005. They got venture funding from Sequoia (sole VC) about a year ago.
Interesting is Aster’s approach to parallel query, which is very focused on reducing the amount of data that needs to be moved around.
-
nCluster tries to place data where it will be consumed . I.e., it aggressively creates disk caches of data that also is on other disks. These are true caches, in that the data persists across queries and is held consistent. The system automagically determines what goes into caches, without specific DBA intervention. This caching goes at least a little beyond the standard practice in other systems of replicating small dimension tables from node to node.
-
nCluster tries to do Group Bys and associated aggregations before joins, which is sometimes possible (e.g., in calculating how much money one customer has spent). When successful, this strategy can reduce the amount of data that needs to be shipped around.
And to answer a question that really should be asked of all MPP DBMS vendors – when the query executor breaks queries into multiple parts, some of the parts can be primitives rather than just more SQL.
Aster Data also has a pretty interesting story about MPP manageability, based on what seems to be a fair amount of autonomic computing. In particular, you can plug in bare metal – without even an operating system – and the system will install and incorporate it. All this happens in 30 minutes. Even if a node goes down, failover is handled so automagically that queries don’t fail. (Of course, there’s a performance blip.) Backup and bulk data transfer/loading are both parallel and incremental. The system does not use any empty hot standbys. (That said, if Aster’s evolution parallels other vendors’, hot spare disks may eventually show up in the architecture.)
There are more parts of the Aster Data story I want to write about, namely node heterogeneity and MapReduce syntax, but for now I’ll stop here and post this. I’d also like to point you at the Aster Data blog, which is remarkable in its level of architectural detail.
Comments
4 Responses to “Introduction to Aster Data and nCluster”
Leave a Reply
Hi Curt,
Thanks for the post. Just a couple points for clarification:
[1] At MySpace, every piece of data has 2 copies on distinct nodes. More specifically, at MySpace, as well as our other customers, they use RAID 0 on the Aster Worker nodes and RAID 10 on the Aster Queen nodes. [more on our 3-tiered architecture here: (http://www.asterdata.com/product/architecture.html)] Our recommendation is to always use RAID 0 on the workers, because it gives you better performance when a disk fails: with RAID10, if a disk fails, the node stays available, but the performance of that node drops by 50% (and, thus, the performance of the cluster). Because we have full replication and transparent failover, in Aster nCluster if a disk fails, the entire node goes down, but nCluster’s performance only drops by 1/n th (where n is the number of nodes).
[2] re: “parallel query” – The local GROUP BYs is an example; our query optimization algorithms cover the relational algebra and not just 1 case.
What their pricing model? Per terabyte? And how much it costs?
[…] Data’s largest disclosed database, by almost two orders of magnitude, is at […]
personalization server old code and architecture review…
(form SVN) (see part list at…