April 22, 2009

Clearing some of my buffer

I have a large number of posts still in backlog.  For starters, there are ones based on recent visits with Aster, Greenplum, Sybase, Vertica, and a Very Large User.  I suspect I’ll write more soon on Oracle as well.  Plus there’s my whole future-of-online-media area.  And quite a bit more will grow out of planned research.

So there are a whole lot of other worthy subjects I doubt I’ll be getting to any time soon.  In some cases, of course, other people are doing great jobs of writing about same. Here are pointers to a few links that I am glad to recommend:

April 15, 2009

Cloudera presents the MapReduce bull case

Monday was fire-drill day regarding MapReduce vs. MPP relational DBMS. The upshot was that I was quoted in Computerworld and paraphrased in GigaOm as being a little more negative on MapReduce than I really am, in line with my comment

Frankly, my views on MapReduce are more balanced than [my] weary negativity would seem to imply.

Tuesday afternoon the dial turned a couple notches more positive yet, when I talked with Michael Olson and Jeff Hammerbacher of Cloudera. Cloudera is a new company, built around the open source MapReduce implementation Hadoop. So far Cloudera gives away its Hadoop distribution, without charging for any sort of maintenance or subscription, and just gets revenue from professional services. Presumably, Cloudera plans for this business model to change down the road.

Much of our discussion revolved around Facebook, where Jeff directed a huge and diverse Hadoop effort. Apparently, Hadoop played much of the role of an enterprise data warehouse at Facebook — at least for clickstream/network data — including:

Some Facebook data, however, was put into an Oracle RAC cluster for business intelligence. And Jeff does concede that query execution is slower in Hadoop than in a relational DBMS. Hadoop was also used to build the index for Facebook’s custom text search engine.

Jeff’s reasons for liking Hadoop over relational DBMS at Facebook included: Read more

March 20, 2009

Greenplum claims very fast load speeds, and Fox still throws away most of its MySpace data

Data warehouse load speeds are a contentious issue.  Vertica contrived a benchmark with a 5 1/2 terabyte/hour load rate.  Oracle has gotten dinged for very low load speeds, which then are hotly debated.  I was told recently of a Greenplum partner’s salesman steering a prospect who needed rapid load speeds away from Greenplum, which seemed odd to me.

Now Greenplum has come out swinging, claiming “consistent” load speeds of 4 terabytes/hour at its Fox Interactive Media account, and armed with a customer quote saying just that.  Note however that load speeds tend to be proportional to the number of disks, and there are a LOT of disks at that installation.

One way to think about load speeds is — how long would it take to load the entire database? It seems as if the Fox database could be loaded, perhaps not in one week, but certainly in less than two. Flipping that around, the Fox site only has enough capacity to hold less than 2 weeks of detailed data. (This is not uncommon in network event kinds of databases.) And a corollary of that is — worldwide storage sales are still constrained by cost, not by absolute limits on the amounts of data enterprises would like to store.

March 17, 2009

Pervasive DataRush today

In my first post-fire briefing, I had a long-scheduled dinner with the Pervasive DataRush folks.  Much of DataRush’s positioning, feature evolution, and so on remain To Be Determined.  Most existing customers and applications remain To Be Disclosed.  What’s more, DataRush is a technology to accelerate applications that

  1. Need to be parallelized
  2. Should run on SMP rather than shared-nothing hardware

and Pervasive hasn’t done a great job of explaining where #2 applies.

That said, there’s at least one use case for which DataRush should clearly be considered today.  Suppose you have a messy ETL/data transformation task that requires custom code.  Then I see three main choices:

In some cases, DataRush may be best possibility.

February 25, 2009

Partial overview of Ab Initio Software

Ab Initio is an absurdly secretive company, as per a couple of prior posts and the comment threads on same. But yesterday at TDWI I actually found civil people staffing an Ab Initio trade show booth. Based on that conversation and other tidbits, I think it’s fairly safe to say: Read more

February 25, 2009

Introduction to Expressor Software

I’ve chatted a few times with marketing chief Michael Waclawiczek and others at data integration startup Expressor Software. Highlights of the Expressor story include:

Expressor’s real goals, I gather, have little to do with the performance + price positioning. Rather, John Russell had a vision of the ideal data integration tool, with a nice logical flow from step to step, suitable integrated metadata management, easy role-based UIs, and so on. But based on what I saw during an October visit, most of that is a ways away from fruition.

February 25, 2009

Talend update

I chatted yesterday at TDWI with Yves de Montcheuil of Talend, as a follow-up to some chats at Teradata Partners in October. This time around I got more metrics, including:

It seems that Talend’s revenue was somewhat shy of $10 million in 2008.

Specific large paying customers Yves mentioned include: Read more

January 27, 2009

Introduction to Pentaho

I finally caught up with Pentaho, which along with Jaspersoft is one of the two most visible open source business intelligence companies, Actuate perhaps excepted. Highlights included:

Read more

January 4, 2009

Expressor pre-announces a data loading benchmark leapfrog

Expressor Software plans to blow the Vertica/Syncsort “benchmark” out of the water, to wit

What I know already is that our numbers will between 7 and 8 min to load one TB of data and will set another world record for the tpc-h benchmark.

The whole blog post has a delightful air of skepticism, e.g.:

Sometimes the mention of a join and lookup are documented but why? If the files are load ready what is there to join or lookup?

… If the files are load ready and the bulk load interface is used, what exactly is done with the DI product?

My guess… nothing.

…  But what I can’t figure out is what is so complex about this test in the first place?

December 2, 2008

Data warehouse load speeds in the spotlight

Syncsort and Vertica combined to devise and run a benchmark in which a data warehouse got loaded at 5 ½ terabytes per hour, which is several times faster than the figures used in any other vendors’ similar press releases in the past. Takeaways include:

The latter is unsurprising. Back in February, I wrote at length about how Vertica makes rapid columnar updates. I don’t have a lot of subsequent new detail, but it made sense then and now. Read more

← Previous PageNext Page →

Feed: DBMS (database management system), DW (data warehousing), BI (business intelligence), and analytics technology Subscribe to the Monash Research feed via RSS or email:

Login

Search our blogs and white papers

Monash Research blogs

User consulting

Building a short list? Refining your strategic plan? We can help.

Vendor advisory

We tell vendors what's happening -- and, more important, what they should do about it.

Monash Research highlights

Learn about white papers, webcasts, and blog highlights, by RSS or email.