The substance of Pentaho’s Hadoop strategy
Pentaho has been talking about a Hadoop-related strategy. Unfortunately, in support of its Hadoop efforts, Pentaho has been — quite insistently — saying things that don’t make a lot of sense to people who know anything about Hadoop.
That said, I think I found four sensible points in Pentaho’s Hadoop strategy, namely:
- If you use an ETL tool like Pentaho’s to move things in and out of HDFS, you may be able to orchestrate two more steps in the ETL process than if you used Hadoop’s native orchestration tools.
- A lot of what you want to do in MapReduce is things that can be graphically specified in an ETL tool like Pentaho’s. (That would include tokenization or regex.)
- If you have some really lightweight BI requirements (ad hoc, reporting, or whatever) against HDFS data, you might be content to do it straight against HDFS, rather than moving the data into a real DBMS. If so, BI tools like Pentaho’s might be useful.
- Somebody might want to use a screwy version of MapReduce, where by “screwy” I mean anything that isn’t Cloudera Enterprise, Aster Data SQL/MapReduce, or some other implementation/distribution with a lot of supporting tools. In that case, they might need all the tools they can get.
The first of those points is, in the grand scheme of things, pretty trivial.
The third one makes sense. While Hadoop’s Hive client means you could roll your own integration with your own favorite BI tool in any case, having somebody certify it for you themselves could be nice. So if Pentaho ships something that works before other vendors do, good on them. (Target date seems to be October.)
The fourth one is kind of sad.
But if there’s any shovel-meet-pony aspect to all this — or indeed a reason for writing this blog post — it would be the second point. If one understands data management, but is in the “Oh no! Hadoop wants me to PROGRAM!” crowd, then being able to specify one’s MapReduce might be a really nice alternative versus having to actually code it.
Comments
10 Responses to “The substance of Pentaho’s Hadoop strategy”
Leave a Reply
Hi Curt,
I mostly agree to your point #2: integrating Hadoop into pentaho data integration (kettle) will lower the barrier for non-programmers to work with map/reduce on data stored in hadoop.
Another (minor) point that you don’t mention explicitly, is that this will make it easier for ETL/BI developers to simply get the Hadoop data into a datawarehouse or datamart. So it can help even for BI needs that go beyond your point #3
kind regards,
Roland Bouman
Hi Curt,
Thanks for writing about Pentaho’s efforts on Hadoop. You make a general statement at the beginning of your write up: “Pentaho has been – quite insistently – saying things that don’t make a lot of sense to people who know anything about Hadoop”.
We are always working to improve our messaging. Would you mind sharing with us what Pentaho is saying that you don’t think makes sense to the Hadoop Community?
Thanks!
Will Gorman
Will,
I’ve already shared my thoughts with your colleagues in a couple of phone conversations, even though you’re not my clients. Their response was “Oh no. We’re really right, based on conversations with some prospects.”
They finally toned down that part after an exchange along the lines:
Me: “X doesn’t see the point of your story either.”
Pentaho: “Well, that may have been the case a few weeks ago, but we’ve explained things to X since then.”
Me: “Actually, I talked with X yesterday.”
The thing is — Hadoop is used, more often than in any other way, as an ETL tool. The idea that “Captain Pentaho is going to rescue you from Hadoop’s inaccessibility by adding ETL” just doesn’t match Hadoop reality. I also had trouble matching the data flow you suggest from logs to HDFS to data warehouses and data marts against reality. It wasn’t totally crazy, but it was somehow off.
The whole thing sounded as if you guys think HDFS is the essence of Hadoop, with all the various programming stacks — database-focused or otherwise — being secondary. Frankly, that’s pretty insulting to the efforts of a bunch of smart developers.
If you asserted that your technology stack was good in some ways that others aren’t, along the lines of my blog post, that would be one thing. But your implication that you’re swooping in with the first good stack for Hadoop is pretty ridiculous.
Hope this helps,
CAM
Roland,
Your second (minor) point sounds like my #1 — which I suggested was minor. 😉
Best regards,
CAM
Curt,
upon reread, indeed you did mention it in #1. Thanks for correcting 🙂
It sounds that Hadoop is only used as a new data source: HDFS in Pentaho’s ETL tools. I think the ability to build and explore data cube a top of hadoop/hive/hdfs is what most customers want. And custom reports, of course.
I disagree with your comment “used primarily as an ETL Tool”. We use it as an ETL tool and a cheap online/nearline data store alongside the DW. I doubt you will find anyone using hadoop to just process data without storing terabytes in HDFS. One of the primary use cases is not having to use up expensive RDBMS storage to satisfy regulatory or compliance requirements around data that is never or rarely used. It’s also a good place to park ancient history.
One of the problems with hadoop as it is currently instantiated is that you need to be a programmer to do ETL. This leaves out in the cold all your Informatica/Data Stage/ Ab Initio types and feels like a step backwards. The need that Pentahoe is trying to address is for a graphical 4gl that runs on top of map reduce.
For example, in our current hadoop implementation we are parsing logfiles using a combination of PIG and Perl, filtering them down and loading them into the DW. We also store logs for adhoce querying and reloading if we get our filter rules wrong. It works, however I cannot get any use out of my normal ETL developer types, it’s a specialized skillset. Would be nice to replace that parse and filter logic with a graphical 4GL that ran natively on hadoop.
I’ll leave my 2cts comment (if you don’t mind) :
– I use (as a sysadmin) hadoop/pig like an “awk on steroid” and it just works. 🙂
– Sometimes i use nutch (based on hadoop).
– Sometimes i only use HDFS, but very rarely.
Hadoop (and mapreduce) is all about performance, optimisation, and performance with affordable hardware. It obviously require some programming, or at least some kind of “SQL-ish” langage (like Hadoop/Pig).
You want mapreduce to write your very own optimized code for your very own needs. Not doing that is like saying “Ok, i want to GPGPU (cuda, opencl) computing, but i don’t want to code and it don’t have to be optimized for my need”. It make no sense.
And if you buy the NVidia’s CUDA book : it’s 1% of “what is cuda ?” and 99% of “how to write faster code ?”.
The pig documentation and hadoop documentation is very similar to the previous exemple : “what is it ?” and “how to write fast code?”
So, please, what is this “Oh no! Hadoop wants me to PROGRAM!” ?
(Ho no ! Linux want me to use my KEYBOARD?!”)
@unholyguy
I did not get, frankly speaking, your point. What do you disagree with? Pentaho Hadoop integration is still in the stage of vaporware – you can check their web-site.
[…] already posted that BI-plus-light-ETL vendors Pentaho and Datameer are using Hadoop in that […]