MapReduce

Analysis of implementations of and issues associated with the parallel programming framework MapReduce. Related subjects include:

October 10, 2010

Partnering with Cloudera

After I criticized the marketing of the Aster/Cloudera partnership, my clients at Aster Data and Cloudera ganged up on me and tried to persuade me I was wrong. Be that as it may, that conversation and others were helpful to me in understanding the core thesis: Read more

Categories: Analytic technologies, Aster Data, Cloudera, Data warehousing, Database diversity, Hadoop, MapReduce, Parallelization, Petabyte-scale data management

11 Comments

October 10, 2010

EMC/Greenplum notes

I dropped by the former Greenplum for my quarterly consulting visit (scheduled for the first week of Q4 for a couple of reasons, one of them XLDB4). Much of what we discussed was purely advisory and/or confidential — duh! — but there were real, nonconfidential takeaways in two areas.

First, feelings about the EMC acquisition are still very positive.

Hiring has been rapid, on track to roughly quadruple Greenplum’s size over a 1 1/2 year period. These don’t seem to be EMC imports, but rather outside hires, although EMC folks are surely helping in the recruiting.
The former Greenplum is clearly going to pursue more product possibilities than it would have on its own. This augurs well for Greenplum customers.
Griping about big-company bureaucracy is minimal.
I didn’t hear one word about any unwelcome product/business strategy constraints. On the other hand …
… the next Greenplum product announcement you’ll hear about will be one designed to be appealing to the EMC customer base — i.e., to enterprises that EMC is generally successful in selling to.

Categories: Data warehousing, EMC, Greenplum, MapReduce, Parallelization, Predictive modeling and advanced analytics

4 Comments

August 21, 2010

The substance of Pentaho’s Hadoop strategy

Pentaho has been talking about a Hadoop-related strategy. Unfortunately, in support of its Hadoop efforts, Pentaho has been — quite insistently — saying things that don’t make a lot of sense to people who know anything about Hadoop.

That said, I think I found four sensible points in Pentaho’s Hadoop strategy, namely:

If you use an ETL tool like Pentaho’s to move things in and out of HDFS, you may be able to orchestrate two more steps in the ETL process than if you used Hadoop’s native orchestration tools.
A lot of what you want to do in MapReduce is things that can be graphically specified in an ETL tool like Pentaho’s. (That would include tokenization or regex.)
If you have some really lightweight BI requirements (ad hoc, reporting, or whatever) against HDFS data, you might be content to do it straight against HDFS, rather than moving the data into a real DBMS. If so, BI tools like Pentaho’s might be useful.
Somebody might want to use a screwy version of MapReduce, where by “screwy” I mean anything that isn’t Cloudera Enterprise, Aster Data SQL/MapReduce, or some other implementation/distribution with a lot of supporting tools. In that case, they might need all the tools they can get.

The first of those points is, in the grand scheme of things, pretty trivial.

The third one makes sense. While Hadoop’s Hive client means you could roll your own integration with your own favorite BI tool in any case, having somebody certify it for you themselves could be nice. So if Pentaho ships something that works before other vendors do, good on them. (Target date seems to be October.)

The fourth one is kind of sad.

But if there’s any shovel-meet-pony aspect to all this — or indeed a reason for writing this blog post — it would be the second point. If one understands data management, but is in the “Oh no! Hadoop wants me to PROGRAM!” crowd, then being able to specify one’s MapReduce might be a really nice alternative versus having to actually code it.

Categories: Analytic technologies, Business intelligence, Hadoop, MapReduce, Parallelization, Pentaho

10 Comments

August 11, 2010

Big Data is Watching You!

There’s a boom in large-scale analytics. The subjects of this analysis may be categorized as:

People
Financial trades
Electronic networks
Everything else

The most varied, interesting, and valuable of those four categories is the first one.

Categories: Aster Data, Data warehousing, Investment research and trading, Log analysis, MapReduce, Predictive modeling and advanced analytics, RDF and graphs, Specific users, Surveillance and privacy, Telecommunications, Web analytics

6 Comments

July 23, 2010

Some interesting links

In no particular order: Read more

Categories: Business intelligence, EnterpriseDB and Postgres Plus, Fun stuff, Hadoop, Humor, In-memory DBMS, MapReduce, Memory-centric data management, Open source, Oracle, SAP AG

2 Comments

June 30, 2010

Cloudera Enterprise and Hadoop evolution

I talked with Cloudera a couple of weeks ago in connection with the impending release of Cloudera Enterprise. I’d say: Read more

Categories: Cloudera, Data integration and middleware, EAI, EII, ETL, ELT, ETLT, eBay, Hadoop, Investment research and trading, MapReduce, Market share and customer counts, Petabyte-scale data management, Pricing, Specific users, Web analytics

7 Comments

May 7, 2010

Clarifying the state of MPP in-database SAS

I routinely am briefed way in advance of products’ introductions. For that reason and others, it can be hard for me to keep straight what’s been officially announced, introduced for test, introduced for general availability, vaguely planned for the indefinite future, and so on. Perhaps nothing has confused me more in that regard than the SAS Institute’s multi-year effort to get SAS integrated into various MPP DBMS, specifically Teradata, Netezza Twinfin(i), and Aster Data nCluster.

However, I chatted briefly Thursday with Michelle Wilkie, who is the SAS product manager overseeing all this (and also some other stuff, like SAS running on grids without being integrated into a DBMS). As best I understood, the story is: Read more

Categories: Aster Data, Data warehouse appliances, MapReduce, Netezza, Parallelization, Predictive modeling and advanced analytics, SAS Institute, Specific users, Teradata

11 Comments

April 18, 2010

Aster Data’s mapreduce.org site

Aster Data has started a site mapreduce.org, which purports to compile “the best information about MapReduce.” At the moment, mapreduce.org highlights include:

A feed of MapReduce-related posts from several blogs, including this one.
A calendar of MapReduce-related events, not necessarily Aster-specific, integrated with a feed combining …
- … Aster MapReduce-related press releases and also …
- … not necessarily Aster-specific MapReduce-related press articles.
Links to a lot of Aster Data MapReduce-related collateral. Some of that stuff is quite good.*
A sycophantic introduction from Colin White praising the value of the mapreduce.org “independent forum.”

*I did a couple of MapReduce-related webinars for Aster late last year. 🙂 But seriously — Aster does a good job of writing clear and informative collateral.

Categories: Analytic technologies, Aster Data, MapReduce

3 Comments

April 16, 2010

Introduction to Datameer

Elder care issues have flared up with a vengeance, so I’m not going to be blogging much for a while, and surely not at any length. That said, my first post about Datameer was never going to be very long, so lets get right to it:

Datameer offers a business intelligence and analytics stack that runs on any distribution of Hadoop.
Datameer is still building a lot of features that it talks about, for target release in (I think) the fall.
Datameer’s pride and joy is its user interface. Very laudably for a software start-up, Datameer claims to have spent considerable time with professional user interface designers.
Datameer’s core user interface metaphor is formula definition via a spreadsheet.
Datameer includes 124 functions one can use in these formulae, ranging from math stuff to text tokenization.
Datameer does some straight BI, with 4 kinds of “visualization” headed for 20 kinds later. But if you want to do hard-core BI, use Datameer to dump data into an RDBMS and then use the BI tool of your choice. (Datameer’s messaging does tend to obscure or even contradict that point.)
Rather, Datameer seems to be designed for the classic MapReduce use cases of ETL and heavy data crunching.
Datameer’s messaging includes a bit about “Datameer is real-time, even though Hadoop is generally thought of as batch.” So far as I can tell, what that boils down to is …
… Datameer will let you examine sample and/or partial query results before a full Hadoop run is over. Apparently, there are three different ways Datameer lets you do this:
- You can truly query against a sample of the data set.
- You can query against intermediate results, when only some stages of the Hadoop process have already been run.
- You can drill down into a “distributed index,” whatever the heck that means when Datameer says it.
Datameer will let you import data from 15 or so different kinds of sources, SQL, NoSQL, and file system alike.