June 6, 2013

Dave DeWitt responds to Daniel Abadi

A few days ago I posted Daniel Abadi’s thoughts in a discussion of Hadapt, Microsoft PDW (Parallel Data Warehouse)/PolyBase, Pivotal/Greenplum Hawq, and other SQL-Hadoop combinations. This is Dave DeWitt’s response. Emphasis mine.

Read more

Categories: Benchmarks and POCs, Cloudera, Clustering, Data warehousing, Greenplum, Hadapt, Hadoop, MapReduce, Microsoft and SQL*Server, PostgreSQL, SQL/Hadoop integration

6 Comments

May 29, 2013

Syncsort extends Hadoop MapReduce

My client Syncsort:

Is an ETL (Extract/Transform/Load) vendor, whose flagship product DMExpress was evidently renamed to DMX.
Has a strong history in and fondness for sort.
Has announced a new ETL product, DMX-h ETL Edition, which uses Hadoop MapReduce to parallelize DMX by controlling a copy of DMX that resides on every data node of the Hadoop cluster.*
Has also announced the closely-related DMX-h Sort Edition, offering acceleration for the sorts inherent in Map and Reduce steps.
Contributed a patch to Apache Hadoop to open up Hadoop MapReduce to make all this possible.

*Perhaps we should question Syncsort’s previous claims of having strong multi-node parallelism already. 🙂

The essence of the Syncsort DMX-h ETL Edition story is:

DMX-h inherits the various ETL-suite trappings of DMX.
Syncsort claims DMX-h has major performance advantages vs., for example, Hive- or Pig-based alternatives.
With a copy of DMX on every node, DMX-h can do parallel load/export.

More details can be found in a slide deck Syncsort graciously allowed me to post. Read more

Categories: Cloudera, Clustering, EAI, EII, ETL, ELT, ETLT, Hadoop, MapReduce, Syncsort

8 Comments

April 25, 2013

Analytic application themes

I talk with a lot of companies, and repeatedly hear some of the same application themes. This post is my attempt to collect some of those ideas in one place.

1. So far, the buzzword of the year is “real-time analytics”, generally with “operational” or “big data” included as well. I hear variants of that positioning from NewSQL vendors (e.g. MemSQL), NoSQL vendors (e.g. AeroSpike), BI stack vendors (e.g. Platfora), application-stack vendors (e.g. WibiData), log analysis vendors (led by Splunk), data management vendors (e.g. Cloudera), and of course the CEP industry.

Yeah, yeah, I know — not all the named companies are in exactly the right market category. But that’s hard to avoid.

Why this gold rush? On the demand side, there’s a real or imagined need for speed. On the supply side, I’d say:

There are vast numbers of companies offering data-management-related technology. They need ways to differentiate.
Doing analytics at short-request speeds is an obvious data-management-related challenge, and not yet comprehensively addressed.

2. More generally, most of the applications I hear about are analytic, or have a strong analytic aspect. The three biggest areas — and these overlap — are:

Customer interaction
Network and sensor monitoring
Game and mobile application back-ends

Also arising fairly frequently are:

Algorithmic trading
Anti-fraud
Risk measurement
Law enforcement/national security
Healthcare
Stakeholder-facing analytics

I’m hearing less about quality, defect tracking, and equipment maintenance than I used to, but those application areas have anyway been ebbing and flowing for decades.

Categories: Aerospike, Application areas, Business intelligence, Cloudera, Games and virtual worlds, GIS and geospatial, Health care, Investment research and trading, Log analysis, MemSQL, Platfora, Predictive modeling and advanced analytics, Telecommunications, Web analytics, WibiData

2 Comments

March 18, 2013

DBMS development and other subjects

The cardinal rules of DBMS development

Rule 1: Developing a good DBMS requires 5-7 years and tens of millions of dollars.

That’s if things go extremely well.

Rule 2: You aren’t an exception to Rule 1.

In particular:

Concurrent workloads benchmarked in the lab are poor predictors of concurrent performance in real life.
Mixed workload management is harder than you’re assuming it is.
Those minor edge cases in which your Version 1 product works poorly aren’t minor after all.

DBMS with Hadoop underpinnings …

… aren’t exceptions to the cardinal rules of DBMS development. That applies to Impala (Cloudera), Stinger (Hortonworks), and Hadapt, among others. Fortunately, the relevant vendors seem to be well aware of this fact. Read more

Categories: Aster Data, Cloudera, Columnar database management, Database compression, Hadapt, Hadoop, Hortonworks, IBM and DB2, MarkLogic, Netezza, NoSQL, QlikTech and QlikView, SQL/Hadoop integration, Structured documents, Sybase, Tableau Software, Teradata

36 Comments

March 18, 2013

Dataset management

I coined a new term, dataset management, for my clients at Revelytix, which they indeed adopted to describe what they do. It would also apply to the recently released Cloudera Navigator. To a first approximation, you may think of dataset management as either or both:

Metadata management in a structured-file context.
Lineage/provenance, auditing, and similar stuff.

Why not just say “metadata management”? First, the Revelytix guys have long been in variants of that business, and they’re tired of the responses they get when they use the term. 🙂 Second, “metadata” could apply either to data about the file or to data about the data structures in the file or perhaps to data about data in the file, making “metadata” an even more confusing term in this context than in others.

My idea for the term dataset is to connote more grandeur than would be implied by the term “table”, but less than one might assume for a whole “database”. I.e.:

A dataset contains all the information about something. This makes it a bigger deal than a mere table, which could be meaningless outside the context of a database.
But the totality of information in a “dataset” could be less comprehensive than what we’d expect in a whole “database”.

As for the specific products, both of which you might want to check out:

Cloudera Navigator:
- Is one product from a leading Hadoop company.
- Assumes you use Cloudera’s flavor of Hadoop.
- Is generally available.
- Starts with auditing (lineage coming soon).
Revelytix Loom:
- Is the main product of a small metadata management company.
- Is distro-agnostic.
- Is in beta.
- Already does lineage.

Categories: Cloudera, Hadoop

6 Comments

February 27, 2013

Hadoop distributions

Elephants! Elephants!
One elephant went out to play
Sat on a spider’s web one day.
They had such enormous fun
Called for another elephant to come.

Elephants! Elephants!
Two elephants went out to play
Sat on a spider’s web one day.
They had such enormous fun
Called for another elephant to come.

Elephants! Elephants!
Three elephants went out to play
Etc.

— Popular children’s song

It’s Strata week, with much Hadoop news, some of which I’ve been briefed on and some of which I haven’t. Rather than delve into fine competitive details, let’s step back and consider some generalities. First, about Hadoop distributions and distro providers:

Conceptually, the starting point for a “Hadoop distribution” is some version of Apache Hadoop.
- Hortonworks is still focused on Hadoop 1 (without YARN and so on), because that’s what’s regarded as production-ready. But Hortonworks does like HCatalog.
- Cloudera straddles Hadoop 1 and Hadoop 2, shipping aspects of Hadoop 2 but not recommending them for production use.
- Some of the newer distros seem to be based on Hadoop 2, if the markitecture slides are to be believed.
Optionally, the version numbers of different parts of Hadoop in a distribution could be a little mismatched, if the distro provider takes responsibility for testing them together.
- Cloudera seems more willing to do that than Hortonworks.
Different distro providers may choose different sets of Apache Hadoop subprojects to include.
- Cloudera seems particularly expansive in what it is apt to include. Perhaps not coincidentally, Cloudera folks started various Hadoop subprojects.
Optionally, distro providers’ additional proprietary code can be included, to be used either in addition to or instead of Apache Hadoop code. (In the latter case, marketing can then ensue about whether this is REALLY a Hadoop distribution.)
- Hortonworks markets from a “more open source than thou” stance, even though:
  - It is not a purist in that regard.
  - That marketing message is often communicated by Hortonworks’ very closed-source partners.
- Several distro providers, notably Cloudera, offer management suites as a big part of their proprietary value-add. Hortonworks, however, is focused on making open-source Ambari into a competitive management tool.
- Performance is another big area for proprietary code, especially from vendors who look at HDFS (Hadoop Distributed File System) and believe they can improve on it.
- I conjecture packaging/installation code is often proprietary, but that’s a minor issue that doesn’t get mentioned much.
Optionally, third parties’ code can be provided, open or closed source as the case may be.

Most of the same observations could apply to Hadoop appliance vendors.

Categories: Cloudera, Data warehouse appliances, EMC, Greenplum, Hadoop, Hortonworks, IBM and DB2, Intel, MapR, Market share and customer counts

5 Comments

February 17, 2013

Notes and links, February 17, 2013

1. It boggles my mind that some database technology companies still don’t view compression as a major issue. Compression directly affects storage and bandwidth usage alike — for all kinds of storage (potentially including RAM) and for all kinds of bandwidth (network, I/O, and potentially on-server).

Trading off less-than-maximal compression so as to minimize CPU impact can make sense. Having no compression at all, however, is an admission of defeat.

2. People tend to misjudge Hadoop’s development pace in either of two directions. An overly expansive view is to note that some people working on Hadoop are trying to make it be all things for all people, and to somehow imagine those goals will soon be achieved. An overly narrow view is to note an important missing feature in Hadoop, and think there’s a big business to be made out of offering it alone.

At this point, I’d guess that Cloudera and Hortonworks have 500ish employees combined, many of whom are engineers. That allows for a low double-digit number of 5+ person engineering teams, along with a number of smaller projects. The most urgently needed features are indeed being built. On the other hand, a complete monument to computing will not soon emerge.

3. Schooner’s acquisition by SanDisk has led to the discontinuation of Schooner’s SQL DBMS SchoonerSQL. Schooner’s flash-optimized key-value store Membrain continues. I don’t have details, but the Membrain web page suggests both data store and cache use cases.

4. There’s considerable personnel movement at Boston-area database technology companies right now. Please ping me directly if you care.

Categories: Amazon and its cloud, Cloud computing, Cloudera, Database compression, EMC, Exadata, Hadoop, Hortonworks, Market share and customer counts, Open source, Oracle, Schooner Information Technology, Software as a Service (SaaS), Theory and architecture

5 Comments

November 1, 2012

Notes and comments — October 31, 2012

Time for another catch-all post. First and saddest — one of the earliest great commenters on this blog, and a beloved figure in the Boston-area database community, was Dan Weinreb, whom I had known since some Symbolics briefings in the early 1980s. He passed away recently, much much much too young. Looking back for a couple of examples — even if you’ve never heard of him before, I see that Dan ‘s 2009 comment on Tokutek is still interesting today, and so is a post on his own blog disagreeing with some of my choices in terminology.

Otherwise, in no particular order:

1. Chris Bird is learning MongoDB. As is common for Chris, his comments are both amusing and enlightening.

2. When I relayed Cloudera’s comments on Hadoop adoption, I left out a couple of categories. One Cloudera called “mobile”; when I probed, that was about HBase, with an example being messaging apps.

The other was “phone home” — i.e., the ingest of machine-generated data from a lot of different devices. This is something that’s obviously been coming for several years — but I’m increasingly getting the sense that it’s actually arrived.

Categories: Cloudera, Data integration and middleware, Hadoop, HBase, Informatica, Metamarkets and Druid, MongoDB, NoSQL, Open source, Telecommunications

2 Comments

October 24, 2012

Quick notes on Impala

Edit: There is now a follow-up post on Cloudera Impala with substantially more detail.

In my world it’s possible to have a hasty 2-hour conversation, and that’s exactly what I had with Cloudera last week. We touched on hardware and general adoption, but much of the conversation was about Cloudera Impala, announced today. Like Hive, Impala turns Hadoop into a basic analytic RDBMS, with similar SQL/Hadoop integration benefits to those of Hadapt. In particular:

Impala is Hive-compatible in query language (HQL, which is a whole lot like SQL), metadata, JDBC/ODBC drivers, etc.
Unlike Hive, Impala does not work through Hadoop MapReduce.
Unlike Hadoop MapReduce and hence Hive, Impala does not persist intermediate results to disk. This is good for performance, but on extremely long-running queries it increases the risk you’ll have a node failure and have to restart the query from scratch.
Impala in its first version is missing some Hive syntax, notably in support for UDFs (User-Defined Functions).

Beyond that: Read more

Categories: Cloudera, Columnar database management, Database compression, Hadapt, Hadoop, MapReduce, Open source, SQL/Hadoop integration

6 Comments

← Previous Page — Next Page →

Search our blogs and white papers

Monash Research blogs

DBMS 2 covers database management, analytics, and related technologies.
Text Technologies covers text mining, search, and social software.
Strategic Messaging analyzes marketing and messaging strategy.
The Monash Report examines technology and public policy issues.
Software Memories recounts the history of the software industry.

User consulting

Building a short list? Refining your strategic plan? We can help.

Vendor advisory

We tell vendors what's happening -- and, more important, what they should do about it.

Monash Research highlights

Learn about white papers, webcasts, and blog highlights, by RSS or email.

Links
- Monash Research
- White Papers
Admin
- Log in

Dave DeWitt responds to Daniel Abadi

Syncsort extends Hadoop MapReduce

Analytic application themes

DBMS development and other subjects

Dataset management

Hadoop distributions

Notes and links, February 17, 2013

More on Cloudera Impala

Notes and comments — October 31, 2012

Quick notes on Impala

Search our blogs and white papers

Monash Research blogs

User consulting

Vendor advisory

Monash Research highlights

Recent posts

Categories

Date archives

Admin