Parallelization

Analysis of issues in parallel computing, especially parallelized database management. Related subjects include:

October 11, 2010

NoSQL overview

My NoSQL article is finally posted; I hope it lives up to all the foreshadowing. It is being run online at Intelligent Enterprise/Information Week, as per the link above, where Doug Henschen edited it with an admirably light touch.

Below please find three excerpts* that convey the essence of my thinking on NoSQL. For much more detail, please see the article itself.

*Notwithstanding my admiration for Doug’s editing, the excerpts are taken from my final pre-editing submission, not from the published article itself.

My quasi-definition of “NoSQL” wound up being: Read more

Categories: Database diversity, NoSQL, Parallelization

18 Comments

October 10, 2010

Quick introduction to Schooner Information Technology appliances

Back in August I talked with John Busch of Schooner Information Technology, which has a non-obvious URL. Schooner Information Technology sells Flash-based appliances that are mainly intended to run MySQL with blazing write performance.

This is one of those cases in which I warned that due to my September wave of family health issues I would cut a few blogging corners, so:

I’m only going to write about the MySQL aspect, even though Schooner has a memcached product and claims to be able to run other NoSQL stuff as well.
I’m not going to dig for company information beyond recalling:
- Schooner said that it has invested $20 million in R&D.
- Schooner’s appliances are resold by IBM.
- Schooner also has a direct sales force.
- One flagship customer had 30 TB of data on 17 Schooner nodes.

If Schooner wants to add some of what I’ve left out into the comments to this post, that would be great.

Schooner appliances are meant to be clustered, Read more

Categories: memcached, MySQL, OLTP, Parallelization, Schooner Information Technology, Solid-state memory

4 Comments

October 10, 2010

A few notes from XLDB 4

As much as I believe in the XLDB conferences, I only found time to go to (a big) part of one day of XLDB 4 myself. In general: Read more

Categories: Analytic technologies, Health care, Michael Stonebraker, MySQL, Open source, Parallelization, Petabyte-scale data management, Scientific research, Surveillance and privacy

2 Comments

October 10, 2010

Partnering with Cloudera

After I criticized the marketing of the Aster/Cloudera partnership, my clients at Aster Data and Cloudera ganged up on me and tried to persuade me I was wrong. Be that as it may, that conversation and others were helpful to me in understanding the core thesis: Read more

Categories: Analytic technologies, Aster Data, Cloudera, Data warehousing, Database diversity, Hadoop, MapReduce, Parallelization, Petabyte-scale data management

11 Comments

October 10, 2010

EMC/Greenplum notes

I dropped by the former Greenplum for my quarterly consulting visit (scheduled for the first week of Q4 for a couple of reasons, one of them XLDB4). Much of what we discussed was purely advisory and/or confidential — duh! — but there were real, nonconfidential takeaways in two areas.

First, feelings about the EMC acquisition are still very positive.

Hiring has been rapid, on track to roughly quadruple Greenplum’s size over a 1 1/2 year period. These don’t seem to be EMC imports, but rather outside hires, although EMC folks are surely helping in the recruiting.
The former Greenplum is clearly going to pursue more product possibilities than it would have on its own. This augurs well for Greenplum customers.
Griping about big-company bureaucracy is minimal.
I didn’t hear one word about any unwelcome product/business strategy constraints. On the other hand …
… the next Greenplum product announcement you’ll hear about will be one designed to be appealing to the EMC customer base — i.e., to enterprises that EMC is generally successful in selling to.

Categories: Data warehousing, EMC, Greenplum, MapReduce, Parallelization, Predictive modeling and advanced analytics

4 Comments

August 26, 2010

More on NoSQL and HVSP (or OLRP)

Since posting last Wednesday morning that I’m looking into NoSQL and HVSP, I’ve had a lot of conversations, including with (among others):

Dwight Merriman of 10gen (MongoDB)
Damien Katz of Couchio (CouchDB)
Matt Pfeil of Riptano (Cassandra)
Todd Lipcon of Cloudera (HBase committer)
Tony Falco of Basho (Riak)
John Busch of Schooner
Ori Herrnstadt of Akiban

Categories: Akiban, Basho and Riak, Cache, Cassandra, Cloudera, Clustrix, CouchDB, DataStax, Facebook, Hadoop, HBase, memcached, MySQL, NewSQL, NoSQL, Object, OLTP, Open source, Parallelization, Schooner Information Technology, Theory and architecture, Tokutek and TokuDB

3 Comments

August 21, 2010

The substance of Pentaho’s Hadoop strategy

Pentaho has been talking about a Hadoop-related strategy. Unfortunately, in support of its Hadoop efforts, Pentaho has been — quite insistently — saying things that don’t make a lot of sense to people who know anything about Hadoop.

That said, I think I found four sensible points in Pentaho’s Hadoop strategy, namely:

If you use an ETL tool like Pentaho’s to move things in and out of HDFS, you may be able to orchestrate two more steps in the ETL process than if you used Hadoop’s native orchestration tools.
A lot of what you want to do in MapReduce is things that can be graphically specified in an ETL tool like Pentaho’s. (That would include tokenization or regex.)
If you have some really lightweight BI requirements (ad hoc, reporting, or whatever) against HDFS data, you might be content to do it straight against HDFS, rather than moving the data into a real DBMS. If so, BI tools like Pentaho’s might be useful.
Somebody might want to use a screwy version of MapReduce, where by “screwy” I mean anything that isn’t Cloudera Enterprise, Aster Data SQL/MapReduce, or some other implementation/distribution with a lot of supporting tools. In that case, they might need all the tools they can get.

The first of those points is, in the grand scheme of things, pretty trivial.

The third one makes sense. While Hadoop’s Hive client means you could roll your own integration with your own favorite BI tool in any case, having somebody certify it for you themselves could be nice. So if Pentaho ships something that works before other vendors do, good on them. (Target date seems to be October.)

The fourth one is kind of sad.

But if there’s any shovel-meet-pony aspect to all this — or indeed a reason for writing this blog post — it would be the second point. If one understands data management, but is in the “Oh no! Hadoop wants me to PROGRAM!” crowd, then being able to specify one’s MapReduce might be a really nice alternative versus having to actually code it.

Categories: Analytic technologies, Business intelligence, Hadoop, MapReduce, Parallelization, Pentaho

10 Comments

August 18, 2010

I’m collecting data points on NoSQL and HVSP adoption

I was asked to do a magazine article on NoSQL, where by “NoSQL” is meant “whatever they talk about at NoSQL conferences.” By now the number of publications planning to run the article is up to 2, the deadline is next week and, crucially, it has been agreed that I may talk about HVSP in general, NoSQL and SQL alike.

It also is understood that, realistically, I can’t be expected to know and mention the very latest news for all the many products in the categories. Even so, I think this would be fine time to check just where NoSQL and HVSP adoption stand. Here is most of what I know, or links to same; it would be great if you guys would contribute additional data in the comment thread.

In the NoSQL area: Read more

Categories: Akiban, Cassandra, Clustering, Clustrix, Couchbase, dbShards and CodeFutures, Facebook, Groovy Corporation, NewSQL, NoSQL, OLTP, Parallelization, ScaleDB, Specific users, VoltDB and H-Store, Zynga

18 Comments

August 18, 2010

Finally confirmed: Membase has a reasonable product roadmap

On my recent trip to California, neither I nor my clients at Northscale covered ourselves in meeting-arranging glory. Still, from the rushed 30 minute meeting we did wind up having, I finally came away feeling good about Membase’s product direction.

To review, Membase is a reasonably elastic persistent data store, sporting the memcached API, making memcached/Membase an attractive alternative to memcached/sharded MySQL. As of now, Membase is a pure key-value store.

Northscale defends pure key-value stores by arguing, in effect: Read more

Categories: Couchbase, memcached, NoSQL, Parallelization

5 Comments

August 11, 2010