SQL-Hadoop architectures compared
The genesis of this post is:
- Dave DeWitt sent me a paper about Microsoft Polybase.
- I argued with Dave about the differences between Polybase and Hadapt.
- I asked Daniel Abadi for his opinion.
- Dan agreed with Dave, in a long email …
- … that he graciously permitted me to lightly-edit and post.
I love my life.
Per Daniel (emphasis mine): Read more
Categories: Aster Data, Data warehousing, Greenplum, Hadapt, Hadoop, MapReduce, Microsoft and SQL*Server, SQL/Hadoop integration, Theory and architecture | 13 Comments |
DBMS development and other subjects
The cardinal rules of DBMS development
Rule 1: Developing a good DBMS requires 5-7 years and tens of millions of dollars.
That’s if things go extremely well.
Rule 2: You aren’t an exception to Rule 1.
In particular:
- Concurrent workloads benchmarked in the lab are poor predictors of concurrent performance in real life.
- Mixed workload management is harder than you’re assuming it is.
- Those minor edge cases in which your Version 1 product works poorly aren’t minor after all.
DBMS with Hadoop underpinnings …
… aren’t exceptions to the cardinal rules of DBMS development. That applies to Impala (Cloudera), Stinger (Hortonworks), and Hadapt, among others. Fortunately, the relevant vendors seem to be well aware of this fact. Read more
Greenplum HAWQ
My former friends at Greenplum no longer talk to me, so in particular I wasn’t briefed on Pivotal HD and Greenplum HAWQ. Pivotal HD seems to be yet another Hadoop distribution, with the idea that you use Greenplum’s management tools. Greenplum HAWQ seems to be Greenplum tied to HDFS.
The basic idea seems to be much like what I mentioned a few days ago — the low-level file store for Greenplum can now be something else one has heard of before, namely HDFS (Hadoop Distributed File System, which is also an option for, say, NuoDB). Beyond that, two interesting quotes in a Greenplum blog post are:
When a query starts up, the data is loaded out of HDFS and into the HAWQ execution engine.
and
In addition, it has native support for HBase, supporting HBase predicate pushdown, hive[sic] connectivity, and offering a ton of intelligent features to retrieve HBase data.
The first sounds like the invisible loading that Daniel Abadi wrote about last September on Hadapt’s blog. (Edit: Actually, see Daniel’s comment below.) The second sounds like a good idea that, again, would also be a natural direction for vendors such as Hadapt.
Categories: EMC, Greenplum, Hadapt, Hadoop, HBase, SQL/Hadoop integration | 14 Comments |
Should you offer “complete” analytic applications?
WibiData is essentially on the trajectory:
- Started with platform-ish technology.
- Selling analytic application subsystems, focused for now on personalization.
- Hopeful of selling complete analytic applications in the future.
The same, it turns out, is true of Causata.* Talking with them both the same day led me to write this post. Read more
Categories: Hadapt, HBase, Market share and customer counts, PivotLink, Predictive modeling and advanced analytics, WibiData | 5 Comments |
Quick notes on Impala
Edit: There is now a follow-up post on Cloudera Impala with substantially more detail.
In my world it’s possible to have a hasty 2-hour conversation, and that’s exactly what I had with Cloudera last week. We touched on hardware and general adoption, but much of the conversation was about Cloudera Impala, announced today. Like Hive, Impala turns Hadoop into a basic analytic RDBMS, with similar SQL/Hadoop integration benefits to those of Hadapt. In particular:
- Impala is Hive-compatible in query language (HQL, which is a whole lot like SQL), metadata, JDBC/ODBC drivers, etc.
- Unlike Hive, Impala does not work through Hadoop MapReduce.
- Unlike Hadoop MapReduce and hence Hive, Impala does not persist intermediate results to disk. This is good for performance, but on extremely long-running queries it increases the risk you’ll have a node failure and have to restart the query from scratch.
- Impala in its first version is missing some Hive syntax, notably in support for UDFs (User-Defined Functions).
Beyond that: Read more
Categories: Cloudera, Columnar database management, Database compression, Hadapt, Hadoop, MapReduce, Open source, SQL/Hadoop integration | 6 Comments |
Hadoop/RDBMS integration: Aster SQL-H and Hadapt
Two of the more interesting approaches for integrating Hadoop and MapReduce with relational DBMS come from my clients at Teradata Aster (via SQL/MR and SQL-H) and Hadapt. In both cases, the story starts:
- You can dump any kind of data you want into Hadoop’s file system.
- You can have data in a scale-out RDBMS to get good performance on analytic SQL.
- You can access all the data (not just the relationally stored part) via SQL.
- You can do MapReduce on all the data (not just the Hadoop-stored part).
- To varying degrees, Hadapt and Aster each offer three kinds of advantage over Hadoop-with-Hive:
- SQL performance is (much) better.
- SQL functionality is better.
- At least some of your employees — the “business analysts” — can invoke MapReduce processes through SQL, if somebody else (e.g. your techies or the vendor’s) coded them up in the first place.
Of course, there are plenty of differences. Those start: Read more
Categories: Aster Data, Hadapt, Hadoop, Pricing, SQL/Hadoop integration, Teradata | 5 Comments |
The Teradata Aster Big Analytics Aster/Hadoop appliance
My clients at Teradata are introducing a mix-em/match-em Aster/Hadoop box, officially called the Teradata Aster Big Analytics Appliance. Basics include:
- You can fill a rack with nodes either for the Aster DBMS or for Hadoop (Hortonworks flavor), or you can combine them in the same box.
- If you combine them, they share management software (adapted from mainstream Teradata’s) and Infiniband.
- An Aster node has 16 2.6-gigahertz cores and 24 900GB disk drives.
- A Hadoop node has 12 2.0-gigahertz cores and 12 3TB drives.
- A central part of Teradata’s strategy is that Aster and Hadoop nodes can work together via SQL-H.
- The Teradata Aster Big Analytics Appliance is based on a family of Dell servers that fit more compactly into racks than do Teradata’s traditional products.
- The Teradata Aster Big Analytics Appliance replaces a previous interim Teradata Aster appliance that used similar hardware to that in other Teradata systems.
My views on the Teradata Aster Big Analytics Appliance start: Read more
Categories: Aster Data, Hadapt, Pricing, SQL/Hadoop integration, Teradata | 3 Comments |
Hadapt Version 2
My clients at Hadapt are coming out with a Version 2 to be available in Q1 2013, and perhaps slipstreaming some of the features before then. At that point, it will be reasonable to regard Hadapt as offering:
- A very tight integration between an RDBMS-based analytic platform and Hadoop …
- … that is decidedly immature as an analytic RDBMS …
- … but which strongly improves the SQL capabilities of Hadoop (vs., say, the alternative of using Hive).
Solr is in the mix as well.
Hadapt+Hadoop is positioned much more as “better than Hadoop” than “a better scale-out RDBMS”– and rightly so, due to its limitations when viewed strictly from an analytic RDBMS standpoint. I.e., Hadapt is meant for enterprises that want to do several of:
- Dump multi-structured data into Hadoop.
- Refine or just move some of it into an RDBMS.
- Bring in data from other RDBMS.
- Process of all the above via Hadoop MapReduce.
- Process of all the above via SQL.
- Use full-text indexes on the data.
Hadapt has 6 or so production customers, a dozen or so more coming online soon, 35 or so employees (mainly in Cambridge or Poland), reasonable amounts of venture capital, and the involvement of a variety of industry luminaries. Hadapt’s biggest installation seems to have 10s of terabytes of relational data and 100s of TBs of multi-structured; Hadapt is very confident in its ability to scale an order of magnitude beyond that with the Version 2 product, and reasonably confident it could go even further.
At the highest level, Hadapt works like this: Read more
Our clients, and where they are located
From time to time, I disclose our vendor client lists. Another iteration is below, the first since a little over a year ago. To be clear:
- This is a list of Monash Advantage members.
- All our vendor clients are Monash Advantage members, unless …
- … we work with them primarily in their capacity as technology users. (A large fraction of our user clients happen to be SaaS vendors.)
- We do not usually disclose our user clients.
- We do not usually disclose our venture capital clients, nor those who invest in publicly-traded securities.
- Excluded from this round of disclosure is one vendor I have never written about.
- Included in this round of disclosure is one client paying for services partly in stock. All our other clients are cash-only.
For reasons explained below, I’ll group the clients geographically. Obviously, companies often have multiple locations, but this is approximately how it works from the standpoint of their interactions with me. Read more
Hadapt is moving forward
I’ve talked with my clients at Hadapt a couple of times recently. News highlights include:
- The Hadapt 1.0 product is going “Early Access” today.
- General availability of Hadapt 1.0 is targeted for an officially unspecified time frame, but it’s soon.
- Hadapt raised a nice round of venture capital.
- Hadapt added Sharmila Mulligan to the board.
- Dave Kellogg is in the picture too, albeit not as involved as Sharmila.
- Hadapt has moved the company to Cambridge, which is preferable to Yale environs for obvious reasons. (First location = space they’re borrowing from their investors at Bessemer.)
- Headcount is in the low teens, with a target of doubling fast.
The Hadapt product story hasn’t changed significantly from what it was before. Specific points I can add include: Read more