Hadoop notes: Informatica, Splunk, and IBM
Informatica, Splunk, and IBM are all public companies, and correspondingly reticent to talk about product futures. Hence, anything I might suggest about product futures from any of them won’t be terribly detailed, and even the vague generalities are “the Good Lord willin’ an’ the creek don’ rise”.
Never let a rising creek overflow your safe harbor.
Anyhow:
1. Hadoop can be an awesome ETL (Extract/Transform/Load) execution engine; it can handle huge jobs and perform a great variety of transformations. (Indeed, MapReduce was invented to run giant ETL jobs.) Thus, if one offers a development-plus-execution stack for ETL processes, it might seem appealing to make Hadoop an ETL execution option. And so:
- I’ve already posted that BI-plus-light-ETL vendors Pentaho and Datameer are using Hadoop in that way.
- Informatica will be using Hadoop as an execution option too.
Informatica told me about other interesting Hadoop-related plans as well, but I’m not sure my frieNDA allows me to mention them at all.
IBM, however, is standing aside. Specifically, IBM told me that it doesn’t see the point of doing the same thing, as its ETL engine — presumably derived from the old Ascential product line — is already parallel and performant enough.
2. Last year, I suggested that Splunk and Hadoop are competitors in managing machine-generated data. That’s still true, but Splunk is also preparing a Hadoop co-opetition strategy. To a first approximation, it’s just Hadoop import/export. However, suppose you view Splunk as offering a three-layer stack:
- Analytics/visualization.
- Storage/indexing.
- Collection.
Then potentially the data could flow
Native log –> Splunk (collection) –> Hadoop –> Splunk (visualization)
I think that’s cool.
The other Splunk/Hadoop future I know is to enhance the ability for Splunk to capture Hadoop operations data, in two ways:
- Provide some prewritten filters to extract data fields from Hadoop operations logs.
- Get at Hadoop operations data that isn’t found in logs, via operator utilities and the like.
3. I wrote about an important aspect of IBM’s “Big Insights” Hadoop story months ago, namely IBM’s general recommended data topology. Beyond that, IBM offers:
- Its own Hadoop distribution, for free, with a small amount of IBM intellectual property added.
- Proprietary closed-source software, that runs on top of either IBM’s or Cloudera’s Hadoop distributions.
Unfortunately, I didn’t understand what, if anything, is interesting about IBM’s proprietary Hadoop capabilities at this time. There seem to be some Hadoop performance tweaks, and something that sounded like Datameer 1.0 (“Big Sheets”), and surely some management tools as well. But I didn’t grasp any reason to favor Big Insights over, for example, the combination of Datameer and Cloudera Enterprise.
One last note: I was surprised to learn that IBM’s Platform Computing acquisition is not involved in Big Insights. Perhaps that integration will come later on.
Comments
9 Responses to “Hadoop notes: Informatica, Splunk, and IBM”
Leave a Reply
Obviously we at IBM have not done a great job of explaining the value of BigInsights. I can’t do justice to it in this comment but here is a very quick summary. IBM value add on top of open source Apache Hadoop can be segmented in to two areas:
1. Operational, and
2. Analytic capability
On the operational side there are things like security, Adaptive MapReduce (run smaller jobs more efficiently), blue wash (remove concerns on viral aspect of open source), installer, consolidated web console for simplified operation, and additional adapters for getting data out of a number of data sources.
Also on the operational side but moving up the stack in to development and adoption is a comprehensive tool set to help people build Hadoop solutions. This includes things like JAQL, a scripting language that is easier to use than commonly used alternatives e.g. Pig. There is also programming tool set based on Eclipse that greatly simplifies the process of building Hadoop applications. Most important, there are pre-packaged and pre-tested assets we call accelerators that make it easy to put together domain-specific solutions for BigData. Think, telco, social media analytics etc.
On the analytics and visualization side there is BigSheets (a spreadsheet paradigm for dealing with hadoop data), there is System T for text analytics (the same kind used by IBM Watson on Jeopardy) to name a few.
Also, as of v1.4, every BigInsights Enterprise Edition customer gets a node of IBM InfoSphere Streams. Streams provides real-time, in-memory analytics and it is supper fast and supper scalable and can deal with hundreds of thousands of events per second. One of our clients processes over 400K CDRs per second in production. We had benchmarked processing 14 million log records. This is in support of the IBM view that Big Data is not just Hadoop. So, IBM BigInsights customers get a lot more than just Hadoop. They get enterprise ready Hadoop or they can use Cloudera if they already have that. They get IBM analytics, they get development tools and a set of pre-built field tested domain specific accelerators and they get a taste of in-memory real-time analytics.
Thanks, Leon!
On the operational side, what do you offer that Cloudera Enterprise doesn’t?
On the accelerator side, which are actually available today?
On the operational side I’d single out additional job scheduling options, Adaptive MapReduce and IBM specific compression as the top 3 things that BigInsights does that Cloudera does not. Standard Hadoop scheduling (e.g. FAIR, FIFO). BigInsights scheduler (in addition to standard Hadoop schedulers) allows an administrator to optimize for response time by giving small jobs more resources to complete quicker. In many cases, the cost of starting mappers can be quite high. BigInsights Adaptive MapReduce brings these costs down. Hadoop does not natively support splittable text compression i.e. a single map task processes the entire compressed text file. BigInsights (JAQL) automatically recognizes splittable text compression (BigInsights uses LZO) and creates multiple map tasks to operate on a single file.
On the accelerator side, there are 60+ toolkits and accelerators available today for Streams. Most of these are specific to a narrow problem. What we started to do is to combine these to create solution-specific accelerators. A number of these are deployed with clients where we are working with them on enhancing and generalizing the solution.
Thanks, Leon. That helps.
Do you have clients who get more than a few percent benefit in performance from all that?
We have this beer in Canada called Alexander Keith’s. Their slogan is “Those who like it, like it a lot”. I think it describes IBM BigInsights operational enhancements that I described. They don’t produce order of magnitude improvements for everyone. But for those with usecases where these things matter they are very valuable improvements over stock Hadoop.
I do think that it is important not to take each individual feature in isolation. Like I said before, the value of BigInsights is not in any one feature but in the whole package. oh, and btw. you can deploy BigInsights on top of Cloudera if you already have that. One thing that surprises people all the time is how inexpensive BigInsights is. Its price is based on the amount of data under management not number of nodes in the cluster. Low price is not the first thing that springs to mind when thinking of IBM but it is the case with BigInsights.
Curt it was a shame that we didn’t have more time to cover ETL in our recent big data platform briefing. Regarding ETL and Hadoop, IBM is clearly not standing aside. Information Server is already integrated with Hadoop as a source and target for data integration. We have deeper integration planned for an upcoming release. I’d also like to clarify the point regarding performance – IBM’s Information Server has better performance as it is a purpose-built parallel processing engine designed for ETL. Many other vendors have utilized Hadoop as a means to patch-up a performance or scalability issue in their product, whereas our strategy is to utilize Hadoop as a target, source, and when appropriate, a platform for processing workloads.
Of course there are many other cost advantages to a pre-built integration platform over a general purpose tool + build-it yourself approach. Information Server has capabilities for data discovery, profiling, meta data management, data quality, ETL job design, administration, among many other capabilities which yield a much lower total cost of ownership. There are too many points to cover in this blog reply, but we are looking forward to delivering a first deep dive briefing to you on Information Server in the near term.
Thanks, David. I look forward to learning more!
[…] Enterprise adoption of Hadoop for ETL/ELT/data refinement could explode after more software vendors offer support for it. […]
[…] follows through on the Hadoop/Splunk (get it?) co-opetition I foreshadowed last year, including access to Hadoop via the same tools that run over the Splunk data store, plus […]