December 13, 2012

Introduction to Spark, Shark, BDAS and AMPLab

UC Berkeley’s AMPLab is working on a software stack that:

The whole thing has $30 million in projected funding (half government, half industry) and a 6-year plan (which they’re 2 years into).

Specific projects of note in all that include:

Read more

December 12, 2012

Some trends that will continue in 2013

I’m usually annoyed by lists of year-end predictions. Still, a reporter asked me for some, and I found one kind I was comfortable making.

Trends that I think will continue in 2013 include:

Growing attention to machine-generated data. Human-generated data grows at the rate business activity does, plus 0-25%. Machine-generated data grows at the rate of Moore’s Law, also plus 0-25%, which is a much higher total. In particular, the use of remote machine-generated data is becoming increasingly real.

Hadoop adoption. Everybody has the big bit bucket use case, largely because of machine-generated data. Even today’s technology is plenty good enough for that purpose, and hence justifies initial Hadoop adoption. Development of further Hadoop technology, which I post about frequently, is rapid. And so the Hadoop trend is very real.

Application SaaS. The on-premises application software industry has hopeless problems with product complexity and rigidity. Any suite new enough to cut the Gordian Knot is or will be SaaS (Software as a Service).

Newer BI interfaces. Advanced visualization — e.g. Tableau or QlikView — and mobile BI are both hot. So, more speculatively, are “social” BI (Business Intelligence) interfaces.

Price discounts. If you buy software at 50% of list price, you’re probably doing it wrong. Even 25% can be too high.

MySQL alternatives.  NoSQL and NewSQL products often are developed as MySQL alternatives. Oracle has actually done a good job on MySQL technology, but now its business practices are scaring companies away from MySQL commitments, and newer short-request SQL DBMS are ready for use.

Read more

December 9, 2012

Amazon Redshift and its implications

Merv Adrian and Doug Henschen both reported more details about Amazon Redshift than I intend to; see also the comments on Doug’s article. I did talk with Rick Glick of ParAccel a bit about the project, and he noted:

“We didn’t want to do the deal on those terms” comments from other companies suggest ParAccel’s main financial take from the deal is an already-reported venture investment.

The cloud-related engineering was mainly around communications, e.g. strengthening error detection/correction to make up for the lack of dedicated switches. In general, Rick seemed more positive on running in the (Amazon) cloud than analytic RDBMS vendors have been in the past.

So who should and will use Amazon Redshift? For starters, I’d say: Read more

December 9, 2012

ParAccel update

In connection with Amazon’s Redshift announcement, ParAccel reached out, and so I talked with them for the first time in a long while. At the highest level:

There wasn’t time for a lot of technical detail, but I gather that the bit about working alongside other data stores:

Also, it seems that ParAccel:

Read more

December 2, 2012

Are column stores really better at compression?

A consensus has evolved that:

Still somewhat controversial is the claim that:

A strong plausibility argument for the latter point is that new in-memory analytic data stores tend to be columnar — think HANA or Platfora; compression is commonly cited as a big reason for the choice. (Another reason is that I/O bandwidth matters even when the I/O is from RAM, and there are further reasons yet.)

One group that made the in-memory columnar choice is the Spark/Shark guys at UC Berkeley’s AMP Lab. So when I talked with them Thursday (more on that another time, but it sounds like cool stuff), I took some time to ask why columnar stores are better at compression. In essence, they gave two reasons — simplicity, and speed of decompression.

In each case, the main supporting argument seemed to be that finding the values in a column is easier when they’re all together in a column store. Read more

November 29, 2012

Notes on Microsoft SQL Server

I’ve been known to gripe that covering big companies such as Microsoft is hard. Still, Doug Leland of Microsoft’s SQL Server team checked in for phone calls in August and again today, and I think I got enough to be worth writing about, albeit at a survey level only,

Subjects I’ll mention include:

One topic I can’t yet comment about is MOLAP/ROLAP, which is a pity; if anybody can refute my claim that ROLAP trumps MOLAP, it’s either Microsoft or Oracle.

Microsoft’s slides mentioned Yahoo refining a 6 petabyte Hadoop cluster into a 24 terabyte SQL Server “cube”, which was surprising in light of Yahoo’s history as an Oracle reference.

Read more

November 19, 2012

Couchbase 2.0

My clients at Couchbase checked in.

The big changes in Couchbase 2.0 versus the previous (1.8.x) version are:

Couchbase 2.0 is upwards-compatible with prior versions of Couchbase (and hence with Memcached), but not with CouchDB.

Technology notes on Couchbase 2.0 include: Read more

November 19, 2012

Incremental MapReduce

My clients at Cloudant, Couchbase, and 10gen/MongoDB (Edit: See Alex Popescu’s comment below) all boast the feature incremental MapReduce. (And they’re not the only ones.) So I feel like making a quick post about it. For starters, I’ll quote myself about Cloudant:

The essence of Cloudant’s incremental MapReduce seems to be that data is selected only if it’s been updated since the last run. Obviously, this only works for MapReduce algorithms whose eventual output can be run on different subsets of the target data set, then aggregated in a simple way.

These implementations of incremental MapReduce are hacked together by teams vastly smaller than those working on Hadoop, and surely fall short of Hadoop in many areas such as performance, fault-tolerance, and language support. That’s a given. Still, if the jobs are short and simple, those deficiencies may be tolerable.

A StackOverflow thread about MongoDB’s version of incremental MapReduce highlights some of the implementation challenges.

But all practicality aside, let’s return to the point that incremental MapReduce only works for some kinds of MapReduce-based algorithms, and consider how much of a limitation that really is. Looking at the Map steps sheds a little light: Read more

November 9, 2012

Analytic application subsystems

Imagine a website whose purpose is to encourage consumers to take actions — for example to click on an ad, click on the next page, or actually make a purchase. Best practices for such a site include:

Those predictive models themselves will keep changing, because:

In that situation, what would it mean to offer the website owner a predictive modeling “application”? Read more

November 5, 2012

Do you need an analytic RDBMS?

I can think of seven major reasons not to use an analytic RDBMS. One is good; but the other six seem pretty questionable, niche circumstances excepted, especially at this time.

The good reason to not have an analytic RDBMS is that most organizations can run perfectly well on some combination of:

Those enterprises, however, are generally not who I write for or about.

The six bad reasons to not have an analytic RDBMS all take the form “Can’t some other technology do the job better?”, namely:

Read more

← Previous PageNext Page →

Feed: DBMS (database management system), DW (data warehousing), BI (business intelligence), and analytics technology Subscribe to the Monash Research feed via RSS or email:

Login

Search our blogs and white papers

Monash Research blogs

User consulting

Building a short list? Refining your strategic plan? We can help.

Vendor advisory

We tell vendors what's happening -- and, more important, what they should do about it.

Monash Research highlights

Learn about white papers, webcasts, and blog highlights, by RSS or email.