DBMS product categories
Analysis of database management technology in specific product categories. Related subjects include:
Thoughts on in-memory columnar add-ons
Oracle announced its in-memory columnar option Sunday. As usual, I wasn’t briefed; still, I have some observations. For starters:
- Oracle, IBM (Edit: See the rebuttal comment below), and Microsoft are all doing something similar …
- … because it makes sense.
- The basic idea is to take the technology that manages indexes — which are basically columns+pointers — and massage it into an actual column store. However …
- … the devil is in the details. See, for example, my May post on IBM’s version, called BLU, outlining all the engineering IBM did around that feature.
- Notwithstanding certain merits of this approach, I don’t believe in complete alternatives to analytic RDBMS. The rise of analytic DBMS oriented toward multi-structured data just strengthens that point.
I’d also add that Larry Ellison’s pitch “build columns to avoid all that index messiness” sounds like 80% bunk. The physical overhead should be at least as bad, and the main saving in administrative overhead should be that, in effect, you’re indexing ALL columns rather than picking and choosing.
Anyhow, this technology should be viewed as applying to traditional business transaction data, much more than to — for example — web interaction logs, or other machine-generated data. My thoughts around that distinction start:
- I argued back in 2011 that traditional databases will wind up in RAM, basically because …
- … Moore’s Law will make it ever cheaper to store them there.
- Still, cheaper != cheap, so this is a technology only to use with your most valuable data — i.e., that transactional stuff.
- These are very tabular technologies, without much in the way of multi-structured data support.
Categories: Columnar database management, Data warehousing, IBM and DB2, Memory-centric data management, Microsoft and SQL*Server, OLTP, Oracle, SAP AG, Workday | 6 Comments |
Tokutek’s interesting indexing strategy
The general Tokutek strategy has always been:
- Write indexes efficiently, which …
- … makes it reasonable to have more indexes, which …
- … lets more queries run fast.
But the details of “writes indexes efficiently” have been hard to nail down. For example, my post about Tokutek indexing last January, while not really mistaken, is drastically incomplete.
Adding further confusion is that Tokutek now has two product lines:
- TokuDB, a MySQL storage engine.
- TokuMX, in which the parts of MongoDB 2.2 that roughly equate to a storage engine are ripped out and replaced with Tokutek code.
TokuMX further adds language support for transactions and a rewrite of MongoDB’s replication code.
So let’s try again. I had a couple of conversations with Martin Farach-Colton, who:
- Is a Tokutek co-founder.
- Stayed in academia.
- Is a data structures guy, not a database expert per se.
The core ideas of Tokutek’s architecture start: Read more
Categories: Database compression, MongoDB, MySQL, NewSQL, Tokutek and TokuDB | 4 Comments |
Hortonworks business notes
Hortonworks did a business-oriented round of outreach, talking with at least Derrick Harris and me. Notes from my call — for which Rob Bearden didn’t bother showing up — include, in no particular order:
- Hortonworks denies advanced acquisition discussions with either Microsoft and Intel. Of course, that doesn’t exactly contradict the widespread story of Intel having made an acquisition offer. Edit: I have subsequently heard, very credibly, that the denial was untrue.
- As vendors usually do, Hortonworks denies the extreme forms of Cloudera’s suggestion that Hortonworks competitive wins relate to price slashing. But Hortonworks does believe that its license fees often wind up being lower than Cloudera’s, due especially to Hortonworks offering few extra-charge items than Cloudera.
- Hortonworks used a figure of ~75 subscription customers. Edit: That figure turns out in retrospect to have been inflated. This does not include OEM sales through, for example, Teradata, Microsoft Azure, or Rackspace. However, that does include …
- … a small number of installations hosted in the cloud — e.g. ~2 on Amazon Web Services — or otherwise remotely. Also, testing in the cloud seems to be fairly frequent, and the cloud can also be a source of data ingested into Hadoop.
- Since Hortonworks a couple of times made it seem that Rackspace was an important partner, behind only Teradata and Microsoft, I finally asked why. Answers boiled down to a Rackspace Hadoop-as-a-service offering, plus joint work to improve Hadoop-on-OpenStack.
- Other Hortonworks reseller partners seem more important in terms of helping customers consume HDP (Hortonworks Data Platform), rather than for actually doing Hortonworks’ selling for it. (This is unsurprising — channel sales rarely are a path to success for a product that is also appropriately sold by a direct force.)
- Hortonworks listed its major industry sectors as:
- Web and retailing, which it identifies as one thing.
- Media.
- Telecommunications.
- Health care (various subsectors).
- Financial services, which it called “competitive” in the kind of tone that usually signifies “we lose a lot more than we win, and would love to change that”.
In Hortonworks’ view, Hadoop adopters typically start with a specific use case around a new type of data, such as clickstream, sensor, server log, geolocation, or social. Read more
Things I keep needing to say
Some subjects just keep coming up. And so I keep saying things like:
Most generalizations about “Big Data” are false. “Big Data” is a horrific catch-all term, with many different meanings.
Most generalizations about Hadoop are false. Reasons include:
- Hadoop is a collection of disparate things, most particularly data storage and application execution systems.
- The transition from Hadoop 1 to Hadoop 2 will be drastic.
- For key aspects of Hadoop — especially file format and execution engine — there are or will be widely varied options.
Hadoop won’t soon replace relational data warehouses, if indeed it ever does. SQL-on-Hadoop is still very immature. And you can’t replace data warehouses unless you have the power of SQL.
Note: SQL isn’t the only way to provide “the power of SQL”, but alternative approaches are just as immature.
Most generalizations about NoSQL are false. Different NoSQL products are … different. It’s not even accurate to say that all NoSQL systems lack SQL interfaces. (For example, SQL-on-Hadoop often includes SQL-on-HBase.)
Hortonworks, Hadoop, Stinger and Hive
I chatted yesterday with the Hortonworks gang. The main subject was Hortonworks’ approach to SQL-on-Hadoop — commonly called Stinger — but at my request we cycled through a bunch of other topics as well. Company-specific notes include:
- Hortonworks founder J. Eric “Eric14” Baldeschwieler is no longer at Hortonworks, although I imagine he stays closely in touch. What he’s doing next is unspecified, except by the general phrase “his own thing”. (Derrick Harris has more on Eric’s departure.)
- John Kreisa still is at Hortonworks, just not as marketing VP. Think instead of partnerships and projects.
- ~250 employees.
- ~70-75 subscription customers.
Our deployment and use case discussions were a little confused, because a key part of Hortonworks’ strategy is to support and encourage the idea of combining use cases and workloads on a single cluster. But I did hear:
- 10ish nodes for a typical starting cluster.
- 100ish nodes for a typical “data lake” committed adoption.
- Teradata UDA (Unified Data Architecture)* customers sometimes (typically?) jumping straight to a data lake scenario.
- A few users in the 10s of 1000s of nodes. (Obviously Yahoo is one.)
- HBase used in >50% of installations.
- Hive probably even more than that.
- Hortonworks is seeing a fair amount of interest in Windows Hadoop deployments.
*By the way — Teradata seems serious about pushing the UDA as a core message.
Ecosystem notes, in Hortonworks’ perception, included:
- Cloudera is obviously Hortonworks’ biggest distro competitor. Next is IBM, presumably in its blue-forever installed base. MapR is barely on the radar screen; Pivotal’s likely rise hasn’t yet hit sales reports.
- Hortonworks evidently sees a lot of MicroStrategy and Tableau, and some Platfora and Datameer, the latter two at around the same level of interest.
- Accumulo is a big deal in the Federal government, and has gotten a few health care wins as well. Its success is all about security. (Note: That’s all consistent with what I hear elsewhere.)
I also asked specifically about OpenStack. Hortonworks is a member of the OpenStack project, contributes nontrivially to Swift and other subprojects, and sees Rackspace as an important partner. But despite all that, I think strong Hadoop/OpenStack integration is something for the indefinite future.
Hortonworks’ views about Hadoop 2.0 start from the premise that its goal is to support running a multitude of workloads on a single cluster. (See, for example, what I previously posted about Tez and YARN.) Timing notes for Hadoop 2.0 include:
- It’s been in preview/release candidate/commercial beta mode for weeks.
- Q3 is the goal; H2 is the emphatic goal.
- Yahoo’s been in production with YARN >8 months, and has no MapReduce 1 clusters left. (Yahoo has >35,000 Hadoop nodes.)
- The last months of delays have been mainly about sprucing up various APIs and protocols, which may need to serve for a similar multi-year period as Hadoop 1’s have. But there also was some YARN stabilization into May.
Frankly, I think Cloudera’s earlier and necessarily incremental Hadoop 2 rollout was a better choice than Hortonworks’ later big bang, even though the core-mission aspect of Hadoop 2.0 is what was least ready. HDFS (Hadoop Distributed File System) performance, NameNode failover and so on were well worth having, and it’s more than a year between Cloudera starting supporting them and when Hortonworks is offering Hadoop 2.0.
Hortonworks’ approach to doing SQL-on-Hadoop can be summarized simply as “Make Hive into as good an analytic RDBMS as possible, all in open source”. Key elements include: Read more
“Disruption” in the software industry
I lampoon the word “disruptive” for being badly overused. On the other hand, I often refer to the concept myself. Perhaps I should clarify. 🙂
You probably know that the modern concept of disruption comes from Clayton Christensen, specifically in The Innovator’s Dilemma and its sequel, The Innovator’s Solution. The basic ideas are:
- Market leaders serve high-end customers with complex, high-end products and services, often distributed through a costly sales channel.
- Upstarts serve a different market segment, often cheaply and/or simply, perhaps with a different business model (e.g. a different sales channel).
- Upstarts expand their offerings, and eventually attack the leaders in their core markets.
In response (this is the Innovator’s Solution part):
- Leaders expand their product lines, increasing the value of their offerings in their core markets.
- In particular, leaders expand into adjacent market segments, capturing margins and value even if their historical core businesses are commoditized.
- Leaders may also diversify into direct competition with the upstarts, but that generally works only if it’s via a separate division, perhaps acquired, that has permission to compete hard with the main business.
But not all cleverness is “disruption”.
- Routine product advancement by leaders — even when it’s admirably clever — is “sustaining” innovation, as opposed to the disruptive stuff.
- Innovative new technology from small companies is not, in itself, disruption either.
Here are some of the examples that make me think of the whole subject. Read more
The refactoring of everything
I’ll start with three observations:
- Computer systems can’t be entirely tightly coupled — nothing would ever get developed or tested.
- Computer systems can’t be entirely loosely coupled — nothing would ever get optimized, in performance and functionality alike.
- In an ongoing trend, there is and will be dramatic refactoring as to which connections wind up being loose or tight.
As written, that’s probably pretty obvious. Even so, it’s easy to forget just how pervasive the refactoring is and is likely to be. Let’s survey some examples first, and then speculate about consequences. Read more
Webinar Wednesday, June 26, 1 pm EST — Real-Time Analytics
I’m doing a webinar Wednesday, June 26, at 1 pm EST/10 am PST called:
Real-Time Analytics in the Real World
The sponsor is MemSQL, one of my numerous clients to have recently adopted some version of a “real-time analytics” positioning. The webinar sign-up form has an abstract that I reviewed and approved … albeit before I started actually outlining the talk. 😉
Our plan is:
- I’ll review the multiple technologies and use cases that various companies call “real-time analytics”. I’m not planning for this part to be at all MemSQL-focused.*
- MemSQL will review some specific use cases they feel their product — memory-centric scale-out RDBMS — has proven it supports.
*MemSQL is debuting pretty high in my rankings of content sponsors who are cool with vendor neutrality. I sent them a draft of my slides mentioning other tech vendors and not them, and they didn’t blink.
In other news, I’ll be in California over the next week. Mainly I’ll be visiting clients — and 2 non-clients and some family — 10:00 am through dinner, but I did set aside time to stop by GigaOm Structure on Wednesday. I have sniffles/cough/other stuff even before I go. So please don’t expect a lot of posts until I’ve returned, rested up a bit, and also prepared my webinar deck.
Categories: Analytic technologies, In-memory DBMS, MemSQL, NewSQL, Parallelization | 1 Comment |
WibiData and its Kiji technology
My clients at WibiData:
- Think they’re an application software company …
- … but actually are talking about what I call analytic application subsystems.
- Haven’t announced or shipped any of those either …
- … but will shortly.
- Have meanwhile shipped some cool enabling technology.
- Name their products after sushi restaurants.
Yeah, I like these guys. 🙂
If you’re building an application that “obviously” calls for a NoSQL database, and which has a strong predictive modeling aspect, then WibiData has thought more cleverly about what you need than most vendors I can think of. More precisely, WibiData has thought cleverly about your data management, movement, crunching, serving, and integration. For pure modeling sophistication, you should look elsewhere — but WibiData will gladly integrate with or execute those models for you.
WibiData’s enabling technology, now called Kiji, is a collection of modules, libraries, and so on — think Spring — running over Hadoop/HBase. Except for some newfound modularity, it is much like what I described at the time of WibiData’s launch or what WibiData further disclosed a few months later. Key aspects include:
- A way to define schemas in HBase, including ones that change as rapidly as consumer-interaction applications require.
- An analytic framework called “Produce/Gather”, which can execute at human real-time speeds (via its own execution engine) or with higher throughput in batch mode (by invoking Hadoop MapReduce).
- Enough load capabilities, Hive interaction, and so on to get data into the proper structure in Kiji in the first place.
Categories: Hadoop, HBase, NoSQL, Open source, Predictive modeling and advanced analytics, WibiData | 5 Comments |
MemSQL scales out
The third of my three MySQL-oriented clients I alluded to yesterday is MemSQL. When I wrote about MemSQL last June, the product was an in-memory single-server MySQL workalike. Now scale-out has been added, with general availability today.
MemSQL’s flagship reference is Zynga, across 100s of servers. Beyond that, the company claims (to quote a late draft of the press release):
Enterprises are already using distributed MemSQL in production for operational analytics, network security, real-time recommendations, and risk management.
All four of those use cases fit MemSQL’s positioning in “real-time analytics”. Besides Zynga, MemSQL cites penetration into traditional low-latency markets — financial services (various subsectors) and ad-tech.
Highlights of MemSQL’s new distributed architecture start: Read more