Hadapt (commercialized HadoopDB)
The HadoopDB company Hadapt is finally launching, based on the HadoopDB project, albeit with code rewritten from scratch. As you may recall, the core idea of HadoopDB is to put a DBMS on every node, and use MapReduce to talk to the whole database. The idea is to get the same SQL/MapReduce integration as you get if you use Hive, but with much better performance* and perhaps somewhat better SQL functionality.** Advantages vs. a DBMS-based analytic platform that includes MapReduce — e.g. Aster Data — are less clear.
*At least if the underlying DBMS is a fast one. Hadapt likes VectorWise for that purpose, and is showing performance comparisons that assume VectorWise is underneath.
**It seems that Hadapt in the future is assured of having more SQL coverage than Hive does today.
It’s still early days for the Hadapt company. Funding is on the angel level. There seem to be six employees — Yale professor Daniel Abadi, CEO Justin Borgman, Chief Scientist Kamil Bajda-Pawlikowski,* and three other coders. The Hadapt product will go into beta at an unspecified future time; there currently are a couple of alpha users/design partners. The Hadapt company, a Yale spin-off, obviously needs to move from Connecticut soon. I wasn’t able to detect any particular outside experience in the form of directors or advisors. And Hadapt’s marketing efforts are still somewhat ragged. So basically, the reasons for believing in Hadapt pretty much boil down to:
- Daniel Abadi is a star.**
- Hadapt’s own tests show that Hadapt is a whole lot faster than Hive.
*Bajda-Pawlikowski is one of the two Abadi students who did the HadoopDB work. It turns out he had numerous years of coding experience before entering graduate school. (The other student, Azza Abouzeid, is pursuing an academic career.)
**Vertica was built around Daniel’s C-Store Ph.D. thesis. He was involved in H-Store as well. He has a really good blog. He’s a really nice guy. Etc.
As you might have guessed from the name, the Hadapt guys are proud that their technology is “adaptive,” which communicates their fond belief that Hadapt’s query optimization and planning are more modern and cool than other folks’ query planning and optimization. In particular, Daniel suggested that Hadapt is more thoughtful than most DBMS are about looking at the size of intermediate result sets and then replanning queries accordingly.
However, the really cool adaptivity point is that Hadapt watches the performance of individual nodes, and takes that into account in query replanning. Daniel asserts, credibly, that this is a Really Good Feature to have in cloud and/or virtualized environments, where Hadapt might not have full control and use of its nodes. I’d add that it could also give Hadapt a lot of flexibility to be run on clusters of non-identical machines.
On the negative side, Hadapt will not at first have any awareness of how its underlying DBMS are optimized; it will plan for VectorWise the same way it does for PostgreSQL. In that regard, this is a DATAllegro 1.0 story. If I understood correctly, Hadapt has specific connectors for a couple of DBMS (probably exactly those two), and can also talk JDBC to anything. PostgreSQL was apparently 5X faster than MySQL when tested (with either ISAM or InnoDB); Daniel snorted about, for example, MySQL’s apparent fondness for nested-loop joins over hybrid hash. On the other hand, he was more circumspect about his reasons for favoring VectorWise over, to name another open source columnar DBMS, Infobright.
And finally, a couple of other points:
- Hadapt will be closed source, although it will of course rely on large amounts of other people’s open source software. Pay no attention to the importance Daniel previously ascribed to HadoopDB’s open source nature.
- Hadapt decompresses data before moving it from node to node, and also before doing non-SQL MapReduce operations on it. Pay no attention to the years Daniel spent insisting columnar DBMS absolutely must operate on data in compressed form.
Comments
12 Responses to “Hadapt (commercialized HadoopDB)”
Leave a Reply
Thanks for the kind words. A few minor things:
– Hadapt doesn’t have any official preference for VectorWise over Infobright. We have spent a little more time working with VectorWise and PostgreSQL in our early stages, but Infobright is third on the list and could possibly rise moving forward.
– We do some operations on compressed data in the DBMS, but it is true that we don’t do as much as we did in previous projects I was involved in (see my 2006 paper on this topic within C-Store). That will hopefully improve moving forward, but we are somewhat limited by the API given to us from the underlying DBMS we use. For example, VectorWise is still in the early stages, and does very little direct operation on compressed data (though their decompression is super-fast).
– Although our investors are not very comfortable with open source, I will do everything I can to get the source code to our academic collaborators.
What happens to VoltDB now?
No relationship to VoltDB, which Daniel hasn’t seemed to be much involved with for a long time.
Although our investors are not very comfortable with open source, I will do everything I can to get the source code to our academic collaborators.
Prof Abadi- you are either an open source software or you arent. or you can have some community edition. or make it free for academics.
otherwise you will end up violating the GPL- and I don’t think/know/understand how you could create the closed source version, non GPL compliant DB based on software that relies on open source components. it seems your investors would rather be acquired by TeraData/oracle /hp/ibm (the usual exit) and take their chances with GPL violations
Ajay, I respectfully disagree with your comments around licensing.
HadoopDB, the inspiration for Hadapt, is licensed under Apache 2. Thus it is quite acceptable to incorporate HadoopDB concepts (if not the actual code), add value and release the resulting work under a closed source commercial license. In addition, as the authors and exclusive owners of the IP they develop, Hadapt is entitled to also provide that IP to whomever they wish under whatever licensing arrangements they choose.
Thus, Dr. Abadi’s comments about providing source code to academic collaborators are quite consistent with both the letter and spirit of open source licensing.
Folks — as much as I may disagree with Fred on other matters, when it comes to pure open source mechanics, he is well worth listening to. 😉
[…] at this time. Duncan Pauly, Founder & CTO @ JustOne Database Inc Answer appears here http://www.dbms2.com/2011/03/23/…7:57amView All 0 CommentsCannot add comment at this time. Add […]
[…] a 24 hour or so period, Daniel Abadi, Dmitriy Ryaboy and Randolph Pullen all remarked on MySQL’s lack of hash joins. (It relies on […]
Interesting but I am not convinced of the Hadapt strategy.
Using MapReduce primitives as the main distribution mechanism also occurred to me during the development of the DeepCloud project but I discarded it and TCP when I started to designing for Vectorwise based nodes.
In the early DeepCloud prototypes, it became apparent that a node DBMS that was 70 times faster, just made the data movement system look really slow.
Instead of a query taking 1 minute to run and 1 minute for data movement, it took 1 second to run and 1 minute for data movement. The result being that the 70 fold increase available from Vectorwise was turned into less than a mere doubling of performance in the MPP array. Clearly conventional I/O systems are not up to the task.
Although you can cache these data movements, they still have to be done from time to time. DeepCloud gets around this by speeding up the data transfers with RDMA from OpenMPI and OpenFabrics interconnecting meshes (there is more on this at http://www.deepcloud.co)
Additionally MapReduce has latency problems that are both inherent in its batch nature and its use of TCP.
TCP round trip messaging times (ie calling through the ISO 7 layer protocols, OS, NIC, cable, switch and back again) can be as high as 400 ms. OpenFabrics/Infiniband round trips can do this in less than 4ms. Such propagation delays become much more important as the cluster becomes larger.
TCP typically increases the communications latency by at least a factor of 10. MapReduce increases this further still. IMHO the combination is not suitable for large clusters, where real time responses are required.
Randolph,
Interesting. A significant fraction of the MPP analytic DBMS vendors have gone away from TCP — Oracle (Exadata), DATAllegro (dunno about Microsoft), Teradata, Netezza, ParAccel off the top of my head. Probably more.
Curt,
I noticed a small error in my previous post:
“calling through the ISO 7 layer protocols, OS, NIC, cable, switch and back again”
Should have read:
“calling through the ISO 7 layer protocols, OS, NIC, cable, switch, cable, NIC, OS, ISO protocols and back again” but I think you get the gist of it anyway.
BTW:
Apparently MapReduce has even worse latency than TCP as it struggles from ‘straggler’ problems (according to Google). It sounds a bit like the “adaptability” that Daniel talks about above may be an attempt to address this.
Curt,
Yes, the MPP vendors simply can’t keep scaling up on TCP, that’s because these propagation delays ‘gang up’ as you scale. Depending on the system design the effect may be exponential.
ie: N nodes talk to N nodes * 400 ms = a long time (when N is big)
Even strategies like Greenplum’s, where multiple Ethernets are used to split the load can only lower the total latency by the number of Ethernet backplanes, (usually between 2-4, unless they have changed this since I worked there).
The brilliance of the OpenMPI solution (I take no credit, its all down to the hard work of people like Jeff Squires and Ralph Castain) is that it applies clever optimisations – like the algorithms that propagate multicasts by binary neighbor messaging and shared memory-based transfers. (Check this paper out for an example of one technique: http://www.open-mpi.org/papers/cac-2007/cac-2007.pdf)
The OpenMPI approach attempts to reduce the N*N cluster communications problem down to N logN transfers or less whenever possible. OpenFabrics or Infiniband then come into play to reduce latency further.
This technology has allowed OpenMPI to be scaled to 3,060 nodes on the IBM “Roadrunner” and 4,480 nodes on the Sandia “Thunderbird” supercomputers.
I believe the *theoretical* maximum of OpenMPI is over 16,000 nodes. At around 3TB per node, this gives DeepCloud MPP a theoretical maximum of about 49 PB.
Does anyone want to lend us ‘some’ hardware so we can prove the theory? ☺