Clearing some of my buffer
I have a large number of posts still in backlog. For starters, there are ones based on recent visits with Aster, Greenplum, Sybase, Vertica, and a Very Large User. I suspect I’ll write more soon on Oracle as well. Plus there’s my whole future-of-online-media area. And quite a bit more will grow out of planned research.
So there are a whole lot of other worthy subjects I doubt I’ll be getting to any time soon. In some cases, of course, other people are doing great jobs of writing about same. Here are pointers to a few links that I am glad to recommend:
- I wrote recently that I’ve discovered a number of different in-memory OLAP engines. Cindi Howson far outdid that, writing at length for Intelligent Enterprise on in-memory analytics, in an article that seems to itself be a teaser for a longer, free white paper on the subject.
- CouchDB posted an eye-catching, risque slide presentation promoting CouchDB and, more generally, key-value stores, at least for internet applications. And yes, they’ve integrated MapReduce.
- Merv Adrian posted favorably about Birst, with special reference to its OEM efforts. As previously noted, I was highly unimpressed with Birst’s end-user BI story at the time of its September roll-out, and Jerome Pineau’s recent examination did nothing to reassure me. But perhaps OEM is a different matter.
- Merv also offers an interesting post about data integration upstart Expressor, and a highly favorable one about “visualization” vendor Tableau.
- Ann All interviewed Nigel Pendse, who grumped that BI features are overrated, and what end users really want is great query performance. I’m not so sure about the features side of that, but I’m hugely in agreement about the performance. That’s a big part of why the analytic DBMS industry is so vibrant. It’s also why in-memory OLAP is suddenly so hot.
Cloudera presents the MapReduce bull case
Monday was fire-drill day regarding MapReduce vs. MPP relational DBMS. The upshot was that I was quoted in Computerworld and paraphrased in GigaOm as being a little more negative on MapReduce than I really am, in line with my comment
Frankly, my views on MapReduce are more balanced than [my] weary negativity would seem to imply.
Tuesday afternoon the dial turned a couple notches more positive yet, when I talked with Michael Olson and Jeff Hammerbacher of Cloudera. Cloudera is a new company, built around the open source MapReduce implementation Hadoop. So far Cloudera gives away its Hadoop distribution, without charging for any sort of maintenance or subscription, and just gets revenue from professional services. Presumably, Cloudera plans for this business model to change down the road.
Much of our discussion revolved around Facebook, where Jeff directed a huge and diverse Hadoop effort. Apparently, Hadoop played much of the role of an enterprise data warehouse at Facebook — at least for clickstream/network data — including:
- 2 1/2 petabytes of data managed via Hadoop
- 10 terabytes/day of data ingested via Hadoop (Edit: Some of these metrics have been updated in a subsequent post about Facebook.)
- Ad targeting queries run every 15 minutes in Hadoop
- Dashboard roll-up queries run every hour in Hadoop
- Ad-hoc research/analytic Hadoop queries run whenever
- Anti-fraud analysis done in Hadoop
- Text mining (e.g., of things written on people’s “walls”) done in Hadoop
- 100s or 1000s of simultaneous Hadoop queries
- JSON-based social network analysis in Hadoop
Some Facebook data, however, was put into an Oracle RAC cluster for business intelligence. And Jeff does concede that query execution is slower in Hadoop than in a relational DBMS. Hadoop was also used to build the index for Facebook’s custom text search engine.
Jeff’s reasons for liking Hadoop over relational DBMS at Facebook included: Read more
Greenplum claims very fast load speeds, and Fox still throws away most of its MySpace data
Data warehouse load speeds are a contentious issue. Vertica contrived a benchmark with a 5 1/2 terabyte/hour load rate. Oracle has gotten dinged for very low load speeds, which then are hotly debated. I was told recently of a Greenplum partner’s salesman steering a prospect who needed rapid load speeds away from Greenplum, which seemed odd to me.
Now Greenplum has come out swinging, claiming “consistent” load speeds of 4 terabytes/hour at its Fox Interactive Media account, and armed with a customer quote saying just that. Note however that load speeds tend to be proportional to the number of disks, and there are a LOT of disks at that installation.
One way to think about load speeds is — how long would it take to load the entire database? It seems as if the Fox database could be loaded, perhaps not in one week, but certainly in less than two. Flipping that around, the Fox site only has enough capacity to hold less than 2 weeks of detailed data. (This is not uncommon in network event kinds of databases.) And a corollary of that is — worldwide storage sales are still constrained by cost, not by absolute limits on the amounts of data enterprises would like to store.
Categories: Data warehousing, EAI, EII, ETL, ELT, ETLT, Fox and MySpace, Greenplum, Theory and architecture, Web analytics | 3 Comments |
Pervasive DataRush today
In my first post-fire briefing, I had a long-scheduled dinner with the Pervasive DataRush folks. Much of DataRush’s positioning, feature evolution, and so on remain To Be Determined. Most existing customers and applications remain To Be Disclosed. What’s more, DataRush is a technology to accelerate applications that
- Need to be parallelized
- Should run on SMP rather than shared-nothing hardware
and Pervasive hasn’t done a great job of explaining where #2 applies.
That said, there’s at least one use case for which DataRush should clearly be considered today. Suppose you have a messy ETL/data transformation task that requires custom code. Then I see three main choices:
- Write the code within the confines of an off-the-shelf ETL tool.
- Write the code to run on an analytic DBMS platform, ideally an MPP/shared-nothing one.
- Use something like DataRush (and I’m not familiar with any good alternatives to DataRush).
In some cases, DataRush may be best possibility.
Categories: Analytic technologies, Data integration and middleware, Data warehousing, EAI, EII, ETL, ELT, ETLT, Parallelization, Pervasive Software | 1 Comment |
Partial overview of Ab Initio Software
Ab Initio is an absurdly secretive company, as per a couple of prior posts and the comment threads on same. But yesterday at TDWI I actually found civil people staffing an Ab Initio trade show booth. Based on that conversation and other tidbits, I think it’s fairly safe to say: Read more
Categories: Ab Initio Software, Analytic technologies, Benchmarks and POCs, Data integration and middleware, EAI, EII, ETL, ELT, ETLT, Expressor, Pricing, Talend | 14 Comments |
Introduction to Expressor Software
I’ve chatted a few times with marketing chief Michael Waclawiczek and others at data integration startup Expressor Software. Highlights of the Expressor story include:
- Expressor was founded in 2003 and funded in 2007. Two rounds of funding raised $16 million.
- Expressor’s first product release was in May, 2008; before that Expressor built custom integration tools for a couple of customers.
- Michael believes Expressor will have achieved 5 actual sales by the end of this quarter, as well being in 25 “highly active” sales cycles.
- Whatever Expressor’s long-term vision, right now it’s selling mainly on the basis of performance and affordability.
- In particular, Expressor believes it is superior to Ab Initio in both performance and ease of use.
- Expressor says that parallelism (a key aspect of data integration performance, it unsurprisingly seems) took a long time to develop. Obviously, they feel they got it right.
- Expressor is written in C, so as to do hard-core memory management for best performance.
- Expressor founder John Russell seems to have cut his teeth at Info USA, which he left in the 1990s. Other stops on his journey include Trilogy (briefly) and then Knightsbridge, before he branched out on his own.
Expressor’s real goals, I gather, have little to do with the performance + price positioning. Rather, John Russell had a vision of the ideal data integration tool, with a nice logical flow from step to step, suitable integrated metadata management, easy role-based UIs, and so on. But based on what I saw during an October visit, most of that is a ways away from fruition.
Categories: Analytic technologies, Data integration and middleware, EAI, EII, ETL, ELT, ETLT, Expressor, Market share and customer counts | 4 Comments |
Talend update
I chatted yesterday at TDWI with Yves de Montcheuil of Talend, as a follow-up to some chats at Teradata Partners in October. This time around I got more metrics, including:
- Talend revenue grew 6-fold in 2008.
- Talend revenue is expected to grow 3-fold in 2009.
- Talend had >400 paying customers at the end of 2008.
- Talend estimates it has >200,000 active users. This is based on who gets automated updates, looks at documentation, etc.
- ~1/3 of Talend’s revenue is from large customers. 2/3 is from the mid-market.
- Talend has had ~700,000 downloads of its core product, and >3.3 million downloads in all (including documentation, upgrades, etc.)
It seems that Talend’s revenue was somewhat shy of $10 million in 2008.
Specific large paying customers Yves mentioned include: Read more
Categories: Analytic technologies, Data integration and middleware, EAI, EII, ETL, ELT, ETLT, eBay, Market share and customer counts, Specific users, Talend | 5 Comments |
Introduction to Pentaho
I finally caught up with Pentaho, which along with Jaspersoft is one of the two most visible open source business intelligence companies, Actuate perhaps excepted. Highlights included:
- Much like Jaspersoft, Pentaho’s initial focus was mainly on embedded, operational BI.
- However, Pentaho now feels it has a decent end-user GUI as well, and traditional-BI is a bigger part of sales.
- Also, some sales are focused on data integration, perhaps in support of more traditional BI products. Pentaho has even had an Ab Initio replacement in data integration. (Can there be any change more extreme than going from Ab Initio to open source?)
- As an example of technical breadth, Pentaho says that its Mondrian OLAP engine is used by Jaspersoft.
- Pentaho has Excel output, but not in the form of live formulas.
- Pentaho does XQuery.
- Industries with more Pentaho adoption than average include:
- Financial services (traditionally open-source-friendly, according to Pentaho)
- Government (ditto)
- Web 2.0 (obviously ditto)
- Travel/transportation (cash-strapped)
- Frontier Airlines is a Pentaho/Greenplum customer.
- TradeDoubler is a Pentaho/InfoBright customer. (Pentaho thinks that TradeDoubler reloads its warehouse every day, which if true frankly casts some doubt on InfoBright’s architecture.)
- Data mining is something of a Pentaho sideline. There’s some university in New Zealand that built data mining capabilities in Pentaho, and some data mining research is done in that. Separately, Pentaho has been integrated with R.
- Community contributions are concentrated in the areas you’d expect — features some user or system integrator needs for a specific project, connectors, bug reports, and the like.
Categories: Ab Initio Software, Application areas, Business intelligence, Data integration and middleware, EAI, EII, ETL, ELT, ETLT, Greenplum, Infobright, Jaspersoft, Pentaho, Pricing | 7 Comments |
Expressor pre-announces a data loading benchmark leapfrog
Expressor Software plans to blow the Vertica/Syncsort “benchmark” out of the water, to wit
What I know already is that our numbers will between 7 and 8 min to load one TB of data and will set another world record for the tpc-h benchmark.
The whole blog post has a delightful air of skepticism, e.g.:
Sometimes the mention of a join and lookup are documented but why? If the files are load ready what is there to join or lookup?
… If the files are load ready and the bulk load interface is used, what exactly is done with the DI product?
My guess… nothing.
… But what I can’t figure out is what is so complex about this test in the first place?
Categories: Benchmarks and POCs, Data integration and middleware, EAI, EII, ETL, ELT, ETLT, Expressor | Leave a Comment |
Data warehouse load speeds in the spotlight
Syncsort and Vertica combined to devise and run a benchmark in which a data warehouse got loaded at 5 ½ terabytes per hour, which is several times faster than the figures used in any other vendors’ similar press releases in the past. Takeaways include:
- Syncsort isn’t just a mainframe sort utility company, but also does data integration. Who knew?
- Vertica’s design to overcome the traditional slow load speed of columnar DBMS works.
The latter is unsurprising. Back in February, I wrote at length about how Vertica makes rapid columnar updates. I don’t have a lot of subsequent new detail, but it made sense then and now. Read more