November 29, 2012

Notes on Microsoft SQL Server

I’ve been known to gripe that covering big companies such as Microsoft is hard. Still, Doug Leland of Microsoft’s SQL Server team checked in for phone calls in August and again today, and I think I got enough to be worth writing about, albeit at a survey level only,

Subjects I’ll mention include:

Hadoop
Parallel Data Warehouse
PolyBase
Columnar data management
In-memory data management (Hekaton)

One topic I can’t yet comment about is MOLAP/ROLAP, which is a pity; if anybody can refute my claim that ROLAP trumps MOLAP, it’s either Microsoft or Oracle.

Microsoft’s slides mentioned Yahoo refining a 6 petabyte Hadoop cluster into a 24 terabyte SQL Server “cube”, which was surprising in light of Yahoo’s history as an Oracle reference.

Categories: Columnar database management, Data warehouse appliances, Data warehousing, Database compression, Hadoop, Hortonworks, In-memory DBMS, MapReduce, Market share and customer counts, Microsoft and SQL*Server, Oracle, Yahoo

10 Comments

November 19, 2012

Couchbase 2.0

My clients at Couchbase checked in.

After multiple delays, Couchbase 2.0 is well into beta, with general availability being delayed by the holiday season as much as anything else.
Couchbase (the company) now has >350 subscription customers, almost all for Couchbase (the product) — which is to say for what was known as Membase, which is basically a persistent version of Memcached.
There also are many users of open source Couchbase, most famously LinkedIn.
Orbitz is a much-mentioned flagship paying Couchbase customer.
Couchbase customers mainly seem to be replacing a caching layer, Memcached or otherwise.
Couchbase headcount is just under 100.

The big changes in Couchbase 2.0 versus the previous (1.8.x) version are:

JSON storage, including secondary indexes.
Multi-data-center replication.
A back-end change from SQLite to a heavily forked version of CouchDB, called Couchstore.

Couchbase 2.0 is upwards-compatible with prior versions of Couchbase (and hence with Memcached), but not with CouchDB.

Technology notes on Couchbase 2.0 include: Read more

Categories: Basho and Riak, Cache, Cassandra, Clustering, Couchbase, MapReduce, Market share and customer counts, MongoDB, NoSQL, Open source, Structured documents

5 Comments

October 18, 2012

Notes on Hadoop adoption and trends

With Strata/Hadoop World being next week, there is much Hadoop discussion. One theme of the season is BI over Hadoop. I have at least 5 clients claiming they’re uniquely positioned to support that (most of whom partner with a 6th client, Tableau); the first 2 whose offerings I’ve actually written about are Teradata Aster and Hadapt. More generally, I’m hearing “Using Hadoop is hard; we’re here to make it easier for you.”

If enterprises aren’t yet happily running business intelligence against Hadoop, what are they doing with it instead? I took the opportunity to ask Cloudera, whose answers didn’t contradict anything I’m hearing elsewhere. As Cloudera tells it (approximately — this part of the conversation* was rushed): Read more

Categories: Business intelligence, Cloudera, EAI, EII, ETL, ELT, ETLT, Hadoop, HBase, Health care, Investment research and trading, MapR, Market share and customer counts, Telecommunications, Web analytics

5 Comments

October 16, 2012

Hadapt Version 2

My clients at Hadapt are coming out with a Version 2 to be available in Q1 2013, and perhaps slipstreaming some of the features before then. At that point, it will be reasonable to regard Hadapt as offering:

A very tight integration between an RDBMS-based analytic platform and Hadoop …
… that is decidedly immature as an analytic RDBMS …
… but which strongly improves the SQL capabilities of Hadoop (vs., say, the alternative of using Hive).

Solr is in the mix as well.

Hadapt+Hadoop is positioned much more as “better than Hadoop” than “a better scale-out RDBMS”– and rightly so, due to its limitations when viewed strictly from an analytic RDBMS standpoint. I.e., Hadapt is meant for enterprises that want to do several of:

Dump multi-structured data into Hadoop.
Refine or just move some of it into an RDBMS.
Bring in data from other RDBMS.
Process of all the above via Hadoop MapReduce.
Process of all the above via SQL.
Use full-text indexes on the data.

Hadapt has 6 or so production customers, a dozen or so more coming online soon, 35 or so employees (mainly in Cambridge or Poland), reasonable amounts of venture capital, and the involvement of a variety of industry luminaries. Hadapt’s biggest installation seems to have 10s of terabytes of relational data and 100s of TBs of multi-structured; Hadapt is very confident in its ability to scale an order of magnitude beyond that with the Version 2 product, and reasonably confident it could go even further.

At the highest level, Hadapt works like this: Read more

Categories: Analytic technologies, Cloudera, Columnar database management, Data models and architecture, Data warehousing, Hadapt, Hadoop, MapR, MapReduce, Market share and customer counts, SQL/Hadoop integration, Text

4 Comments

August 27, 2012

Aerospike, the former Citrusleaf

My new clients at Aerospike have a range of minor news to announce:

A company and product name change (they used to be Citrusleaf).
Some new people and funding.
In association with an acqui-hire — of AlchemyDB guy Russ Sullivan — some unspecified future technical plans.
A community edition (Aerospike, nee’ Citrusleaf, is closed-source).

Mainly, however, they want to call your attention to the fact that they’ve been selling a fast, reliable key-value store, with a number of production references, and want to suggest that other organizations should perhaps buy it as well.

Generally, the Aerospike product story is as I described in two posts last year. At the highest level:

Aerospike has a key-value data model.
Secondary indexes and so on are still futures.
Aerospike is clustered, of course.
Two hardware/storage choices are encouraged:
- Spinning disk, but you keep all your data in RAM.
- Solid-state disk.

AeroSpike’s three core marketing claims are performance, consistent performance, and uninterrupted operations.

Aerospike’s performance claims are supported by a variety of blazing internal benchmarks.
Aerospike’s consistent performance claims are along the lines of sub-millisecond latency, with 99.9% of responses being within 5 milliseconds, and even a node outage only borking performance for some 10s of milliseconds.
Uninterrupted operation is a core AeroSpike design goal, and the company says that to date, no AeroSpike production cluster has ever gone down.

Aerospike technical details start with the expected: Read more

Categories: Aerospike, Market share and customer counts, Memory-centric data management, NoSQL, Pricing

2 Comments

July 24, 2012

Notes on Datameer

In a short October, 2011 post about Datameer, I wrote:

Datameer is designed to let you do simple stuff on large amounts of data, where “large amounts of data” typically means data in Hadoop, and “simple stuff” includes basic versions of a spreadsheet, of BI, and of EtL (Extract/Transform/Load, without much in the way of T).

That’s all still mainly true, although with the recent Datameer 2.0:

You can run Datameer and the underlying Hadoop on a desktop or workgroup group.
There are some infographics pretty-picture-drawing capabilities, which will surely delight those who like vector-based HTML 5 pictures of coffee cups, saucers and macaroons.
No doubt Datameer has been generally enhanced on multiple fronts.

In essence, Datameer has two positionings.

One is “OK, you’ve got Hadoop — now wouldn’t you like to do something useful with it?” That can include both business intelligence and ETL.
Beyond that, Datameer founder/CEO Stefan Groschupf’s core argument is that schema-on-read is really, really useful, even at the cost of absorbing a potentially large performance hit. In other words, he’s making a case for a form of non-relational BI.

Categories: Business intelligence, Data models and architecture, Datameer, EAI, EII, ETL, ELT, ETLT, Hadoop, Log analysis, Market share and customer counts, Web analytics

8 Comments

July 18, 2012

Clustrix 4.0 and other Clustrix stuff

It feels like time to write about Clustrix, which I last covered in detail in May, 2010, and which is releasing Clustrix 4.0 today. Clustrix and Clustrix 4.0 basics include:

Clustrix makes a short-request processing appliance.
As you might guess from the name, Clustrix is clustered — peer-to-peer, with no head node.
The Clustrix appliance uses flash/solid-state storage.
Traditionally, Clustrix has run a MySQL-compatible DBMS.
Clustrix 4.0 introduces JSON support. More on that below.
Clustrix 4.0 introduces a bunch of administrative features, and parallel backup.
Also in today’s announcement is a Rackspace partnership to offer Clustrix remotely, at monthly pricing.
Clustrix has been shipping product for about 4 years.
Clustrix has 20 customers in production, running >125 Clustrix nodes total.
Clustrix has 60 people.
List price for a (smallest size) Clustrix system is $150K for 3 nodes. Highest-end maintenance costs 15%.
There’s also a $100K version meant for high availability/disaster recovery. Over half of Clustrix’s customers use off-site disaster recovery.
Clustrix is raising a C round. Part of it has already been raised from insiders, as a kind of bridge.

The biggest Clustrix installation seems to be 20 nodes or so. Others seem to have 10+. I presume those disaster recovery customers have 6 or more nodes each. I’m not quite sure how the arithmetic on that all works; perhaps the 125ish count of nodes is a bit low.

Clustrix technical notes include: Read more

Categories: Cloud computing, Clustering, Clustrix, Database compression, Market share and customer counts, MySQL, OLTP, Pricing, Structured documents

4 Comments

July 12, 2012

How important is BI flexibility?

How flexible does business intelligence technology need to be? Should it allow fully flexible ad-hoc data analysis, or does that overwhelm users? Are they perhaps happier with simpler, more prescriptive analytic paths? My answer is a resounding “It depends”.

On the one hand, it’s clear that some users really care about business intelligence flexibility. They don’t want the “right” dimensional hierarchy, carefully worked out in advance. They don’t even want fixed drilldown paths smartly calculated on the fly, ala’ Endeca (which, after all, ultimately didn’t succeed). Rather, they want to be able to truly choose aggregations and roll-ups for themselves.

Supporting this view is the rise of in-memory business intelligence. For example:

SAP HANA is selling in impressive quantities.
Further, HANA and alternatives are generating a lot of buzz. For example:
- Multiple clients have asked me for help positioning their products against HANA and Exalytics.
- Kognitio’s pretense to be HANA-like is getting them some sales too.
QlikView has had considerable success.

But why would anybody pay up for the speed of in-memory BI? Analytic RDBMS offer blazing speed for broad ranges of queries. Parameterized reports let you do drilldowns in memory. So only if you need great flexibility do you need to keep a whole analytic data set permanently in RAM.

Categories: Business intelligence, Market share and customer counts, Memory-centric data management, PivotLink, Teradata

Introduction to Neo Technology and Neo4j

I’ve been talking some with the Neo Technology/Neo4j guys, including Emil Eifrem (CEO/cofounder), Johan Svensson (CTO/cofounder), and Philip Rathle (Senior Director of Products). Basics include:

Neo Technology came up with Neo4j, open sourced it, and is building a company around the open source core product in the usual way.
Neo4j is a graph DBMS.
Neo4j is unlike some other graph DBMS in that:
- Neo4j is designed for OLTP (OnLine Transaction Processing), or at least as a general-purpose DBMS, rather than being focused on investigative analytics.
- To every node or edge managed by Neo4j you can associate an arbitrary collection of (name,value) pairs — i.e., what might be called a document.

Numbers and historical facts include:

> 50 paying Neo4j customers.
Estimated 1000s of production Neo4j users of open source version.*
Estimated 1/3 of paying customers and free users using Neo4j as a “system of record”.
>30,000 downloads/month, in some sense of “download”.
35 people in 6 countries, vs. 25 last December.
$13 million in VC, most of it last October.
Started in 2000 as the underpinnings for a content management system.
A version of the technology in production in 2003.
Neo4j first open-sourced in 2007.
Big-name customers including Cisco, Adobe, and Deutsche Telekom.
Pricing of either $6,000 or $24,000 per JVM per year for two different commercial versions.