Hadapt is moving forward
I’ve talked with my clients at Hadapt a couple of times recently. News highlights include:
- The Hadapt 1.0 product is going “Early Access” today.
- General availability of Hadapt 1.0 is targeted for an officially unspecified time frame, but it’s soon.
- Hadapt raised a nice round of venture capital.
- Hadapt added Sharmila Mulligan to the board.
- Dave Kellogg is in the picture too, albeit not as involved as Sharmila.
- Hadapt has moved the company to Cambridge, which is preferable to Yale environs for obvious reasons. (First location = space they’re borrowing from their investors at Bessemer.)
- Headcount is in the low teens, with a target of doubling fast.
The Hadapt product story hasn’t changed significantly from what it was before. Specific points I can add include:
- With one exception to date, Hadapt beta customers have used PostgreSQL as the underlying DBMS, rather than some faster columnar system.
- Sure, you want to process data on the nodes where it resides on the cluster. But if each copy is replicated 3X or so, that gives you good flexibility to be adaptive by deciding which of the three copies you’ll operate against.
- In Hadapt Version 1.0, scheduling and workload management are pretty much Hadoop’s. However …
- … an improvement in scheduling is being actively researched.
- In general, Hadapt’s design philosophy for executing SQL is to use MapReduce to get data to the proper nodes, while using the underlying DBMS for node-specific operations such as:
- Initial retrieval from disk.
- Joins and aggregations on data residing at (or visiting) a specific node.
A very busy Daniel Abadi also took the time to walk me through how Hadapt does joins. More precisely, what we discussed about joins includes some of the last features being added to Hadapt 1.0; many of the pieces are still missing from early-access Hadapt 1.0, and some may even slip out of the Hadapt 1.0 GA version. As Dan tells it, there are five kinds of joins in Hadapt:
- Co-partitioned join. Both tables being joined happen to be partitioned on the join key. Happy happy joy joy. The tables are joined locally on each node, with the results aggregated via MapReduce.
- Directed join. One of the tables being joined happens to be partitioned on the join key. MapReduce distributes the other table along the join key, joins happen locally, and MapReduce does the rest.
- Broadcast join. One of the tables is broadcast in its entirety to every node. Joins then happen locally, and MapReduce does the rest.
- Split semijoin. One of the tables is projected to the join key and a row ID, and then distributed via MapReduce. Joins then happen locally. Later on, the joined rows are completed with the help of a second projection on the first table. MapReduce does the rest.
- Distributed/parallel hash join. Sometimes, Hadapt indeed joins just as Hadoop/Hive would.
Highlight’s of Hadapt’s performance story include:
- Dan contends that using a DBMS rather than HDFS (Hadoop Distributed File System) for I/O always gives a performance advantage.
- DBMS local-node join performance can be presumed to be superior as well.
- Of course, Dan also thinks that using a columnar DBMS would extend Hadapt’s performance advantage further, but most of the specifics of what Hadapt has told me about why they don’t routinely use a columnar DBMS yet are NDA.
- Even beta Hadapt/PostgreSQL outperforms Hadoop/Hive by almost 10X at Hadapt’s relatively small number of beta customer sites.
Comments
6 Responses to “Hadapt is moving forward”
Leave a Reply
Interesting article on Hadapt.
Although reading about the 5 types of joins you explained, for me it drew uncanny parallels to the join mechanisms on Teradata mainly the Merge and Hash joins.
-> Co-partitioned join is akin to TD’s Merge join on PI=PI
-> Directed join is akin to TD’s Merge join on PI = Non-indexed column where it resdistributes the 2nd table based on join key
-> Broadcast join is akin to the merge join where the smaller table gets duplicated across all amps.
-> Split semijoin/Distributed/parallel hash join sound familiar to hash join and hash joins on the fly.
Add MapReduce on top of TD DBMS and voila – what am i missing?
I don’t think Hadapt claims to have invented much in the way of fundamentally revolutionary database algorithms.
Rather, they’re trying to take what they regard as the best of database theory (more precisely some of the best), apply it in a somewhat new way, and solve whatever engineering challenges then ensue.
[…] Hadapt is a new startup based on some very interesting work at Yale in the area of advanced database technology (parallel databases, column stores). Early access to its flagship product, Hadapt Adaptive Analytic Platform, was just announced the other day. See the good article at dbms2 on the latest Hadapt news: Hadapt happenings Hadapt is moving forward. […]
[…] Hadapt is moving forward […]
[…] in some cases on a cluster shared with another data management systems. (E.g. DataStax/Cassandra, Hadapt/PostgreSQL, or IBM Netezza.) Anyhow, requiring a dedicated cluster isn’t a […]
[…] Greenplum, Aster Data nCluster, Netezza 等等。 6. PostgreSQL 的创新仍在继续,例如http://www.dbms2.com/2011/11/08/hadapt-is-moving-forward/ 7. 无法保证 Oracle 会在 MySQL 项目上持续加强投入力度,特别是 Oracle […]