The questionably named Cloudera Navigator Optimizer
I only have mixed success at getting my clients to reach out to me for messaging advice when they’re introducing something new. Cloudera Navigator Optimizer, which is being announced along with Cloudera 5.5, is one of my failures in that respect; I heard about it for the first time Tuesday afternoon. I hate the name. I hate some of the slides I saw. But I do like one part of the messaging, namely the statement that this is about “refactoring” queries.
All messaging quibbles aside, I think the Cloudera Navigator Optimizer story is actually pretty interesting, and perhaps not just to users of SQL-on-Hadoop technologies such as Hive (which I guess I’d put in that category for simplicity) or Impala. As I understand Cloudera Navigator Optimizer:
- It’s all about analytic SQL queries.
- Specifically, it’s about reducing duplicated work.
- It is not an “optimizer” in the ordinary RDBMS sense of the word.
- It’s delivered via SaaS (Software as a Service).
- Conceptually, it’s not really tied to SQL-on-Hadoop. However, …
- … in practice it likely will be used by customers who want to optimize performance of Cloudera’s preferred styles of SQL-on-Hadoop, either because they’re already using SQL-on-Hadoop or in connection with an initial migration.
It grows out of Xplain.io, which started with the intention of being a general workload optimizer for Hadoop and wound up with this beta announcement of a tuning adviser for analytic SQL.
Right now, the Cloudera Navigator Optimizer service is:
- Query code in.
- Information and advice out.
Naturally, Cloudera’s intention — perhaps as early as at first general availability — is for the output to start including something that’s more like automation, e.g. hints for the Impala optimizer.
As Anupam Singh describes it, there are basically four kinds of problems that Cloudera Navigator Optimizer can help with:
- ETL (Extract/Transform/Load) might repeat the same operation over and over again, e.g. joining to a reference table to help with data cleaning. It can be an optimization to consolidate some of that work. (The same would surely also be true in cases where the workload is more properly described as ELT.)
- For business intelligence it is often helpful to materialize aggregates or result sets. (This is, of course, why materialized views were invented in the first place.)
- Queries-from-hell — perhaps thousands of lines of SQL long — can perhaps be usefully rewritten into a sequence of much shorter queries.
- Ad-hoc query workloads can have enough repetition that there’s opportunity for similar optimizations. Anupam thinks his technology has enough intelligence to detect some of these patterns.
Actually, all four of these cases can involve materializing tables so that they don’t need to keep being in part or whole recreated.
In essence, then, this is a way to add in more query pipelining than the underlying data store automagically provides on its own. And that seems to me like a very good idea to try. The whole thing might be worth trying out at least once, even if your analytic RDBMS installation has nothing to do with SQL at all.
Comments
4 Responses to “The questionably named Cloudera Navigator Optimizer”
Leave a Reply
It sounds like materialized views in Oracle terms (or indexed views in MS SQL terms). Is it comparison valid?
I probably misunderstood in some extent. It is much more, but technically benefits described above are provided in “ideal” world of RDBMS by materialized views.
It’s just an adviser about materialization, so I don’t think the comparison is good at all.
Obviously I didn’t explain well enough. 🙁
[…] Curt Monash’s blog about Xplain.io’s (a company I helped jumpstart and also served as VP of Product Management/marketing until we sold it to Cloudera) technology after it was released as “Cloudera Navigator Optimizer” (DBMS2). […]