March 11, 2013

Hadoop execution enhancements

Hadoop 2.0/YARN is the first big step in evolving Hadoop beyond a strict Map/Reduce paradigm, in that it at least allows for the possibility of non- or beyond-MapReduce processing engines. While YARN didn’t meet its target of general availability around year-end 2012, Arun Murthy of Hortonworks told me recently that:

Yahoo is a big YARN user.
There are other — paying — YARN users.
YARN general availability is now targeted for well before the end of 2013.

Arun further told me about Tez, the next-generation Hadoop processing engine he’s working on, which he also discussed in a recent blog post:

With the emergence of Apache Hadoop YARN as the basis of next generation data-processing architectures, there is a strong need for an application which can execute a complex DAG [Directed Acyclic Graph] of tasks which can then be shared by Apache Pig, Apache Hive, Cascading and others. The constrained DAG expressible in MapReduce (one set of maps followed by one set of reduces) often results in multiple MapReduce jobs which harm latency for short queries (overhead of launching multiple jobs) and throughput for large-scale queries (too much overhead for materializing intermediate job outputs to the filesystem). With Tez, we introduce a more expressive DAG of tasks, within a single application or job, that is better aligned with the required processing task – thus, for e.g., any given SQL query can be expressed as a single job using Tez.

This is similar to the approach of BDAS Spark:

Rather than being restricted to Maps and Reduces, Spark has more numerous primitive operations, including map, reduce, sample, join, and group-by. You can do these more or less in any order.

although Tez won’t match Spark’s richer list of primitive operations.

More specifically, there will be six primitive Tez operations:

HDFS (Hadoop Distributed File System) input and output.
Sorting on input and output (I’m not sure why that’s two operations rather than one).
Shuffling of input and output (ditto).

A Map step would compound HDFS input, output sorting, and output shuffling; a Reduce step compounds — you guessed it! — input sorting, input shuffling, and HDFS output.

I can’t think of much in the way of algorithms that would be logically impossible in MapReduce yet possible in Tez. Rather, the main point of Tez seems to be performance, performance consistency, response-time consistency, and all that good stuff. Specific advantages that Arun and I talked about included:

The requirement for materializing (onto disk) intermediate results that you don’t want to is gone. (Yay!)
Hadoop jobs will step on each other’s toes less. Instead of Maps and Reduces from unrelated jobs getting interleaved, all the operations from a single job will by default be executed in one chunk. (Even so, I see no reason to expect early releases of Tez to do a great job on highly concurrent mixed workload management.)
Added granularity brings opportunities for additional performance enhancements, for example in the area of sorting. (Arun loves sorts.)

Categories: Databricks, Spark and BDAS, Hadoop, Hortonworks, MapReduce, Workload management, Yahoo

14 Comments

March 1, 2013

Open source strategies

From time to time I advise a software vendor on how, whether, or to what extent it should offer its technology in open source. In summary, I believe:

The formal differences between “open source” and “closed source” strategies are of secondary importance.
The attitudinal and emotional differences between “open source” and “closed source” approaches can be large.
A pure closed source strategy can make sense.
A closed source strategy with important open source aspects can make sense.
A pure open source strategy will only rarely win.

Here’s why.

An “open source software” business model and strategy might include:

Software given away for free.
Demand generation to encourage people to use the free version of the software.
Subscription pricing for additional proprietary software and support.
Direct sales, and further marketing, to encourage users of the free stuff to upgrade to a paid version.

A “closed source software” business model and strategy might include:

Demand generation.
Free-download versions of the software.
Subscription pricing for software (increasingly common) and support (always).
Direct sales, and associated marketing.

Those look pretty similar to me.

Of course, there can still be differences between open and closed source. In particular: Read more

Categories: Hadoop, Hortonworks, memcached, MongoDB, Open source

8 Comments

February 27, 2013

Hadoop distributions

Elephants! Elephants!
One elephant went out to play
Sat on a spider’s web one day.
They had such enormous fun
Called for another elephant to come.

Elephants! Elephants!
Two elephants went out to play
Sat on a spider’s web one day.
They had such enormous fun
Called for another elephant to come.

Elephants! Elephants!
Three elephants went out to play
Etc.

— Popular children’s song

It’s Strata week, with much Hadoop news, some of which I’ve been briefed on and some of which I haven’t. Rather than delve into fine competitive details, let’s step back and consider some generalities. First, about Hadoop distributions and distro providers:

Conceptually, the starting point for a “Hadoop distribution” is some version of Apache Hadoop.
- Hortonworks is still focused on Hadoop 1 (without YARN and so on), because that’s what’s regarded as production-ready. But Hortonworks does like HCatalog.
- Cloudera straddles Hadoop 1 and Hadoop 2, shipping aspects of Hadoop 2 but not recommending them for production use.
- Some of the newer distros seem to be based on Hadoop 2, if the markitecture slides are to be believed.
Optionally, the version numbers of different parts of Hadoop in a distribution could be a little mismatched, if the distro provider takes responsibility for testing them together.
- Cloudera seems more willing to do that than Hortonworks.
Different distro providers may choose different sets of Apache Hadoop subprojects to include.
- Cloudera seems particularly expansive in what it is apt to include. Perhaps not coincidentally, Cloudera folks started various Hadoop subprojects.
Optionally, distro providers’ additional proprietary code can be included, to be used either in addition to or instead of Apache Hadoop code. (In the latter case, marketing can then ensue about whether this is REALLY a Hadoop distribution.)
- Hortonworks markets from a “more open source than thou” stance, even though:
  - It is not a purist in that regard.
  - That marketing message is often communicated by Hortonworks’ very closed-source partners.
- Several distro providers, notably Cloudera, offer management suites as a big part of their proprietary value-add. Hortonworks, however, is focused on making open-source Ambari into a competitive management tool.
- Performance is another big area for proprietary code, especially from vendors who look at HDFS (Hadoop Distributed File System) and believe they can improve on it.
- I conjecture packaging/installation code is often proprietary, but that’s a minor issue that doesn’t get mentioned much.
Optionally, third parties’ code can be provided, open or closed source as the case may be.

Most of the same observations could apply to Hadoop appliance vendors.

Categories: Cloudera, Data warehouse appliances, EMC, Greenplum, Hadoop, Hortonworks, IBM and DB2, Intel, MapR, Market share and customer counts

5 Comments

February 25, 2013

Greenplum HAWQ

My former friends at Greenplum no longer talk to me, so in particular I wasn’t briefed on Pivotal HD and Greenplum HAWQ. Pivotal HD seems to be yet another Hadoop distribution, with the idea that you use Greenplum’s management tools. Greenplum HAWQ seems to be Greenplum tied to HDFS.

The basic idea seems to be much like what I mentioned a few days ago — the low-level file store for Greenplum can now be something else one has heard of before, namely HDFS (Hadoop Distributed File System, which is also an option for, say, NuoDB). Beyond that, two interesting quotes in a Greenplum blog post are:

When a query starts up, the data is loaded out of HDFS and into the HAWQ execution engine.

and

In addition, it has native support for HBase, supporting HBase predicate pushdown, hive[sic] connectivity, and offering a ton of intelligent features to retrieve HBase data.

The first sounds like the invisible loading that Daniel Abadi wrote about last September on Hadapt’s blog. (Edit: Actually, see Daniel’s comment below.) The second sounds like a good idea that, again, would also be a natural direction for vendors such as Hadapt.

Categories: EMC, Greenplum, Hadapt, Hadoop, HBase, SQL/Hadoop integration

14 Comments

February 22, 2013

Should you offer “complete” analytic applications?

WibiData is essentially on the trajectory:

Started with platform-ish technology.
Selling analytic application subsystems, focused for now on personalization.
Hopeful of selling complete analytic applications in the future.

The same, it turns out, is true of Causata.* Talking with them both the same day led me to write this post. Read more

Categories: Hadapt, HBase, Market share and customer counts, PivotLink, Predictive modeling and advanced analytics, WibiData

5 Comments

February 21, 2013

One database to rule them all?

Perhaps the single toughest question in all database technology is: Which different purposes can a single data store serve well? — or to phrase it more technically — Which different usage patterns can a single data store support efficiently? Ted Codd was on multiple sides of that issue, first suggesting that relational DBMS could do everything and then averring they could not. Mike Stonebraker too has been on multiple sides, first introducing universal DBMS attempts with Postgres and Illustra/Informix, then more recently suggesting the world needs 9 or so kinds of database technology. As for me — well, I agreed with Mike both times. 🙂

Since this is MUCH too big a subject for a single blog post, what I’ll do in this one is simply race through some background material. To a first approximation, this whole discussion is mainly about data layouts — but only if we interpret that concept broadly enough to comprise:

Every level of storage (disk, RAM, etc.).
Indexes, aggregates and raw data alike.

To date, nobody has ever discovered a data layout that is efficient for all usage patterns. As a general rule, simpler data layouts are often faster to write, while fancier ones can boost query performance. Specific tradeoffs include, but hardly are limited to: Read more

Categories: Columnar database management, Data models and architecture, Data warehousing, Database compression, Database diversity, GenieDB, GIS and geospatial, Hadoop, IBM and DB2, MarkLogic, Michael Stonebraker, Microsoft and SQL*Server, NewSQL, NoSQL, Oracle, PostgreSQL, SAP AG, Solid-state memory, Storage, Structured documents, Text, Theory and architecture, Tokutek and TokuDB

20 Comments

February 17, 2013

Notes and links, February 17, 2013

1. It boggles my mind that some database technology companies still don’t view compression as a major issue. Compression directly affects storage and bandwidth usage alike — for all kinds of storage (potentially including RAM) and for all kinds of bandwidth (network, I/O, and potentially on-server).

Trading off less-than-maximal compression so as to minimize CPU impact can make sense. Having no compression at all, however, is an admission of defeat.

2. People tend to misjudge Hadoop’s development pace in either of two directions. An overly expansive view is to note that some people working on Hadoop are trying to make it be all things for all people, and to somehow imagine those goals will soon be achieved. An overly narrow view is to note an important missing feature in Hadoop, and think there’s a big business to be made out of offering it alone.

At this point, I’d guess that Cloudera and Hortonworks have 500ish employees combined, many of whom are engineers. That allows for a low double-digit number of 5+ person engineering teams, along with a number of smaller projects. The most urgently needed features are indeed being built. On the other hand, a complete monument to computing will not soon emerge.

3. Schooner’s acquisition by SanDisk has led to the discontinuation of Schooner’s SQL DBMS SchoonerSQL. Schooner’s flash-optimized key-value store Membrain continues. I don’t have details, but the Membrain web page suggests both data store and cache use cases.

4. There’s considerable personnel movement at Boston-area database technology companies right now. Please ping me directly if you care.

Categories: Amazon and its cloud, Cloud computing, Cloudera, Database compression, EMC, Exadata, Hadoop, Hortonworks, Market share and customer counts, Open source, Oracle, Schooner Information Technology, Software as a Service (SaaS), Theory and architecture

5 Comments

February 13, 2013

It’s hard to make data easy to analyze

It’s hard to make data easy to analyze. While everybody seems to realize this — a few marketeers perhaps aside — some remarks might be useful even so.

Many different technologies purport to make data easy, or easier, to an analyze; so many, in fact, that cataloguing them all is forbiddingly hard. Major claims, and some technologies that make them, include:

“We get data into a form in which it can be analyzed.” This is the story behind, among others:
- Most of the data integration and ETL (Extract/Transform/Load) industries, software vendors and consulting firms alike.
- Many things that purport to be “analytic applications” or data warehouse “quick starts”.
- “Data reduction” use cases in event processing.*
- Text analytics tools.
- Splunk.
“Forget all that transformation foofarah — just load (or write) data into our thing and start analyzing it immediately.” This at various times has been much of the story behind:
- Relational DBMS, according to their inventor E. F. Codd.
- MOLAP (Multidimensional OnLine Analytic Processing), also according to RDBMS inventor E. F. Codd.
- Any kind of analytic DBMS, or general purpose DBMS used for data warehousing.
- Newer kinds of analytic DBMS that are faster than older kinds.
- The “data mart spin-out” feature of certain analytic DBMS.
- In-memory analytic data stores.
- Hadoop.
- NoSQL DBMS that have a few analytic features.
- TokuDB, similarly.
- Electronic spreadsheets, from VisiCalc to Datameer.
- Splunk.
“Our tools help you with specific kinds of analyses or analytic displays.” This is the story underlying, among others:
- The business intelligence industry.
- The predictive analytics industry.
- Algorithmic trading use cases in complex event processing.*
- Some analytic applications.
- Splunk.

*Complex event/stream processing terminology is always problematic.

My thoughts on all this start: Read more

Categories: Business intelligence, Data warehousing, Derived data, EAI, EII, ETL, ELT, ETLT, Hadoop, In-memory DBMS, Investment research and trading, Memory-centric data management, Microsoft and SQL*Server, MOLAP, NoSQL, Predictive modeling and advanced analytics, salesforce.com, Splunk, Streaming and complex event processing (CEP), Text

6 Comments

February 6, 2013

Key questions when selecting an analytic RDBMS

I recently complained that the Gartner Magic Quadrant for Data Warehouse DBMS conflates many use cases into one set of rankings. So perhaps now would be a good time to offer some thoughts on how to tell use cases apart. Assuming you know that you really want to manage your analytic database with a relational DBMS, the first questions you ask yourself could be:

How big is your database? How big is your budget?
How do you feel about appliances?
How do you feel about the cloud?
What are the size and shape of your workload?
How fresh does the data need to be?

Let’s drill down. Read more

Categories: Buying processes, Cloud computing, Clustering, Data integration and middleware, Data warehouse appliances, Data warehousing, Database compression, Exadata, IBM and DB2, Memory-centric data management, Microsoft and SQL*Server, Netezza, Oracle, Pricing, Teradata, Workload management

3 Comments

February 5, 2013

Comments on Gartner’s 2012 Magic Quadrant for Data Warehouse Database Management Systems — evaluations

To my taste, the most glaring mis-rankings in the 2012/2013 Gartner Magic Quadrant for Data Warehouse Database Management are that it is too positive on Kognitio and too negative on Infobright. Secondarily, it is too negative on HP Vertica, and too positive on ParAccel and Actian/VectorWise. So let’s consider those vendors first.

Gartner seems confused about Kognitio’s products and history alike.

Gartner calls Kognitio an “in-memory” DBMS, which is not accurate.
Gartner doesn’t remark on Kognitio’s worst-in-class* compression.
Gartner gives Kognitio oddly high marks for a late, me-too Hadoop integration strategy.
Gartner writes as if Kognitio’s next attempt at the US market will be the first one, which is not the case.
Gartner says that Kognitio pioneered data warehouse SaaS (Software as a Service), which actually has existed since the pre-relational 1970s.

Gartner is correct, however, to note that Kognitio doesn’t sell much stuff overall.

* non-existent

In the cases of HP Vertica, Infobright, ParAccel, and Actian/VectorWise, the 2012 Gartner Magic Quadrant for Data Warehouse Database Management’s facts are fairly accurate, but I dispute Gartner’s evaluation. When it comes to Vertica: Read more

Categories: 1010data, Actian and Ingres, Amazon and its cloud, Calpont, Columnar database management, Data warehouse appliances, Data warehousing, Database compression, EMC, Exasol, Greenplum, Hadoop, HP and Neoview, IBM and DB2, Infobright, Kognitio, Market share and customer counts, Microsoft and SQL*Server, MicroStrategy, Netezza, Oracle, ParAccel, Petabyte-scale data management, SAND Technology, SAP AG, Software as a Service (SaaS), Sybase, Teradata, VectorWise, Vertica Systems

6 Comments

← Previous Page — Next Page →

Search our blogs and white papers

Monash Research blogs

DBMS 2 covers database management, analytics, and related technologies.
Text Technologies covers text mining, search, and social software.
Strategic Messaging analyzes marketing and messaging strategy.
The Monash Report examines technology and public policy issues.
Software Memories recounts the history of the software industry.

User consulting

Building a short list? Refining your strategic plan? We can help.

Vendor advisory

We tell vendors what's happening -- and, more important, what they should do about it.

Monash Research highlights

Learn about white papers, webcasts, and blog highlights, by RSS or email.

Links
- Monash Research
- White Papers
Admin
- Log in

Hadoop execution enhancements

Open source strategies

Hadoop distributions

Greenplum HAWQ

Should you offer “complete” analytic applications?

One database to rule them all?

Notes and links, February 17, 2013

It’s hard to make data easy to analyze

Key questions when selecting an analytic RDBMS

Comments on Gartner’s 2012 Magic Quadrant for Data Warehouse Database Management Systems — evaluations

Search our blogs and white papers

Monash Research blogs

User consulting

Vendor advisory

Monash Research highlights

Recent posts

Categories

Date archives

Admin