Parallelization

Analysis of issues in parallel computing, especially parallelized database management. Related subjects include:

April 1, 2013

Some notes on new-era data management, March 31, 2013

Hmm. I probably should have broken this out as three posts rather than one after all. Sorry about that.

Performance confusion

Discussions of DBMS performance are always odd, for starters because:

But in NoSQL/NewSQL short-request processing performance claims seem particularly confused. Reasons include but are not limited to:

MongoDB and 10gen

I caught up with Ron Avnur at 10gen. Technical highlights included: Read more

March 24, 2013

Appliances, clusters and clouds

I believe:

I shall explain.

Arguments for hosting applications on some kind of cluster include:

Arguments specific to the public cloud include:

That’s all pretty compelling. However, these are not persuasive reasons to put everything on a SINGLE cluster or cloud. They could as easily lead you to have your VMware cluster and your Exadata rack and your Hadoop cluster and your NoSQL cluster and your object storage OpenStack cluster — among others — all while participating in several different public clouds as well.

Why would you not move work into a cluster at all? First, if ain’t broken, you might not want to fix it. Some of the cluster options make it easy for you to consolidate existing workloads — that’s a central goal of VMware and Exadata — but others only make sense to adopt in connection with new application projects. Second, you might just want device locality. I have a gaming-class PC next to my desk; it drives a couple of monitors; I like that arrangement. Away from home I carry a laptop computer instead. Arguments can be made for small remote-office servers as well.

Read more

March 11, 2013

Hadoop execution enhancements

Hadoop 2.0/YARN is the first big step in evolving Hadoop beyond a strict Map/Reduce paradigm, in that it at least allows for the possibility of non- or beyond-MapReduce processing engines. While YARN didn’t meet its target of general availability around year-end 2012, Arun Murthy of Hortonworks told me recently that:

Arun further told me about Tez, the next-generation Hadoop processing engine he’s working on, which he also discussed in a recent blog post:

With the emergence of Apache Hadoop YARN as the basis of next generation data-processing architectures, there is a strong need for an application which can execute a complex DAG [Directed Acyclic Graph] of tasks which can then be shared by Apache Pig, Apache Hive, Cascading and others.  The constrained DAG expressible in MapReduce (one set of maps followed by one set of reduces) often results in multiple MapReduce jobs which harm latency for short queries (overhead of launching multiple jobs) and throughput for large-scale queries (too much overhead for materializing intermediate job outputs to the filesystem). With Tez, we introduce a more expressive DAG of tasks, within a single application or job, that is better aligned with the required processing task – thus, for e.g., any given SQL query can be expressed as a single job using Tez.

This is similar to the approach of BDAS Spark:

Rather than being restricted to Maps and Reduces, Spark has more numerous primitive operations, including map, reduce, sample, join, and group-by. You can do these more or less in any order.

although Tez won’t match Spark’s richer list of primitive operations.

More specifically, there will be six primitive Tez operations:

A Map step would compound HDFS input, output sorting, and output shuffling; a Reduce step compounds — you guessed it! — input sorting, input shuffling, and HDFS output.

I can’t think of much in the way of algorithms that would be logically impossible in MapReduce yet possible in Tez. Rather, the main point of Tez seems to be performance, performance consistency, response-time consistency, and all that good stuff. Specific advantages that Arun and I talked about included:

February 6, 2013

Key questions when selecting an analytic RDBMS

I recently complained that the Gartner Magic Quadrant for Data Warehouse DBMS conflates many use cases into one set of rankings. So perhaps now would be a good time to offer some thoughts on how to tell use cases apart. Assuming you know that you really want to manage your analytic database with a relational DBMS, the first questions you ask yourself could be:

Let’s drill down. Read more

January 16, 2013

NuoDB marketing mishegas

I must start by apologizing for giving a quote in a press release whose contents I deplore. Unlike occasions on which I’ve posted about inaccurate quotes, in this case the fault is mine. The quote is quite accurate. And NuoDB didn’t mislead me about the release’s contents; I just neglected to ask.

NuoDB evidently subscribes to the marketing fallacy:

But to my taste, NuoDB’s worst travesty is not the deafening drumroll before launch (I asked off their mailing list months before), nor the claim that NuoDB’s launch would be a “big day” for the database industry (annoying but ordinary hype), nor the emergent flock of birds foofarah, nor even NuoDB’s overwrought benchmark marketing (distressingly many vendors do that).

Rather, I think NuoDB’s greatest marketing offense to date is its Codd-imitating “12 rules” for cloud database management. Read more

January 12, 2013

Introduction to NuoDB

NuoDB has an interesting NewSQL story. NuoDB’s core design goals seem to be:

Read more

January 7, 2013

Introduction to GenieDB

GenieDB is one of the newer and smaller NewSQL companies. GenieDB’s story is focused on wide-area replication and uptime, coupled to claims about ease and the associated low TCO (Total Cost of Ownership).

GenieDB is in my same family of clients as Cirro.

The GenieDB product is more interesting if we conflate the existing GenieDB Version 1 and a soon-forthcoming (mid-year or so) Version 2. On that basis:

The heart of the GenieDB story is probably wide-area replication. Specifics there include:  Read more

January 5, 2013

NewSQL thoughts

I plan to write about several NewSQL vendors soon, but first here’s an overview post. Like “NoSQL”, the term “NewSQL” has an identifiable, recent coiner — Matt Aslett in 2011 — yet a somewhat fluid meaning. Wikipedia suggests that NewSQL comprises three things:

I think that’s a pretty good working definition, and will likely remain one unless or until:

To date, NewSQL adoption has been limited.

That said, the problem may lie more on the supply side than in demand. Developing a competitive SQL DBMS turns out to be harder than developing something in the NoSQL state of the art.

Read more

January 5, 2013

Data(base) virtualization — a terminological mess

Data/database virtualization seems to be a hot subject right now, and vendors of a broad variety of different technologies are all claiming to be in the space. A terminological mess has ensued, as Monash’s First and Third Laws of Commercial Semantics are borne out in spades.

If something is like “virtualization”, then it should resemble hypervisors such as VMware. To me:

Anything that claims to be “like virtualization” should be viewed in that light. Read more

December 13, 2012

Spark, Shark, and RDDs — technology notes

Spark and Shark are interesting alternatives to MapReduce and Hive. At a high level:

The key concept here seems to be the RDD. Any one RDD:

Otherwise, there’s a lot of flexibility; an RDD can be a set of tuples, a collection of XML documents, or whatever other reasonable kind of dataset you want. And I gather that:

Read more

← Previous PageNext Page →

Feed: DBMS (database management system), DW (data warehousing), BI (business intelligence), and analytics technology Subscribe to the Monash Research feed via RSS or email:

Login

Search our blogs and white papers

Monash Research blogs

User consulting

Building a short list? Refining your strategic plan? We can help.

Vendor advisory

We tell vendors what's happening -- and, more important, what they should do about it.

Monash Research highlights

Learn about white papers, webcasts, and blog highlights, by RSS or email.