Some notes on new-era data management, March 31, 2013
Hmm. I probably should have broken this out as three posts rather than one after all. Sorry about that.
Performance confusion
Discussions of DBMS performance are always odd, for starters because:
- Workloads and use cases vary greatly.
- In particular, benchmarks such as the YCSB or TPC-H aren’t very helpful.
But in NoSQL/NewSQL short-request processing performance claims seem particularly confused. Reasons include but are not limited to:
- It’s common for databases or at least working sets to be entirely in RAM — but it’s not always required.
- Consistency and durability models vary. What’s more, in some systems — e.g. MongoDB — there’s considerable flexibility as to which model you use.
- In particular, there’s an increasingly common choice in which data is written synchronously to RAM on 2 or more servers, then asynchronously to disk on each of them. Performance in these cases can be quite different from when all writes need to be committed to disk. Of course, you need sufficient disk I/O to keep up, so SSDs (Solid-State Drives) can come in handy.
- Many workloads are inherently single node (replication aside). Others are not.
MongoDB and 10gen
I caught up with Ron Avnur at 10gen. Technical highlights included:
- MongoDB’s tunable consistency seems really interesting, with numerous choices available at the program-statement level.
- All rumored performance problems notwithstanding, Ron asserts that MongoDB often “kicks butt” in actual proof-of-concept (POC) bake-offs.
- Ron cites “12 different language bindings” as a key example of developer functionality giving 10gen an advantage vs. Ron’s previous employer MarkLogic.
- 10gen is working hard on management tools, security, and so on.
- Ron claims that the “MongoDB loses data” knock is a relic of the distant — i.e. 1-2 years ago — past.
- We had the same “Who needs joins?” discussion that I used to have with MarkLogic — Ron’s former company — and which MarkLogic has since disavowed. 😉
- There’s nothing special about MongoDB’s b-tree indexes. (I mention that because Tokutek thinks it offers a faster MongoDB indexing option.)
While this wasn’t a numbers-oriented conversation, business highlights included:
- A lot of MongoDB’s competition is RDBMS — Oracle, SQL Server, MySQL, etc.
- MongoDB’s top NoSQL competitor is Cassandra. 10gen sees less Couchbase than before, and also less HBase than Cassandra.
- There’s yet another favorable MongoDB soft metric — 50,000 registrants for free online education, 2/3 outside the US.
I can add that anecdotal evidence from other industry participants suggests there’s a lot of MongoDB mindshare.
Specific traditional-enterprise use cases we discussed focused on combining data from heterogeneous systems. Specifically mentioned were:
- Reference data/360-degree customer view.
- Reference data about securities.
- Aggregation of analytic results from various analytic systems across an enterprise. (For risk management).
DBAs’ roles in development
A lot of marketing boils down to “We don’t need no stinking DBAs!!!” I’m thinking in particular of:
- NoSQL.
- Hadoop and/or exploratory BI* messaging that positions against the alleged badness of “traditional data warehousing”.
*See in particular the comments to that post.
The worst-case data warehousing scenario is indeed pretty bad. It could feature:
- Much internal discussion and politicking to determine the One True Way to view various data fields, with …
- … lots of ongoing bureaucratic safeguards in the area of data governance.
- Long additional efforts in the area of performance tuning.
- Data integration projects up the wazoo.
But if the goal is just to grab some data from an existing data warehouse, perhaps add in some additional data from the outside, and start analyzing it — well, then there are many attempted solutions to that problem, including from within the analytic RDBMS world. The question is whether the data warehouse administrators try to help — which usually means “Here’s your data; now go away and stop bothering me!” — or whether they focus on “business prevention”.
Meanwhile, on the NoSQL side:
- The smart folks at WibiData felt the need for schema-definition tools over HBase.
- Per Ron Avnur, MongoDB users are clamoring for consistency-rule specification via an administrative (rather than programmatic) UI.
It’s the old loose-/tight-coupling trade-off. Traditional relational practices offer a clean interface between database and code, but bundle the database characteristics for different applications tightly together. NoSQL tends to tie the database for any one app tightly to that app, at the cost of difficulties if multiple applications later try to use the same data. Either can make sense, depending on (for example):
- How it seems natural to organize your development and data administration talent.
- Whether the app is likely to survive long enough that you’ll want to run many other applications against the same database.
Comments
8 Responses to “Some notes on new-era data management, March 31, 2013”
Leave a Reply
“there’s an increasingly common choice in which data is written synchronously to RAM on 2 or more servers, than asynchronously to disk on each of them.”
Here, you meant ‘then,’ not ‘than,’ right? I’m basing it on the context that follows.
Thanks! Fixed.
DBAs….
Your mention of DBAs follows some weak arguments in some vendors collaterals. The issue is conflating data modelers with operational/release roles – then teasing about locally controlled data vs. enterprise data.
The fundamental issue with the arguments is that they offer little argument about their own products’ value. It is instead an appeal for freedom for for-purpose development vs. the constraints of lifecycle and modeling the larger enterprise. (This strategy worked historically for MS in getting departmental MS SQL databases by offering rejection of central IT, despite a weak product.) So, there are always competing (often aspirational) goals and skills/capabilities between the central and managed, and new and changing.
Classical DBAs don’t offer much for Hadoop and other big stores; they certainly have a role with newsql if the app wants it and both sides get along.
There is also an implication that data is being pulled from DW or classic OLTP rdbms to Hadoop – I’ve never seen that. It’s more they fit into broad app swaths. The places where they do overlap seem to be around analytic products or push of extracts into a common store
Aaron,
I disagree on several fronts.
First, MS-SQL succeeded in its “weak” days in large part due to price and also to the superiority of its tools. It really was much simpler to install and administer. Eventually, Dan Rosenberg came over from Borland to open Oracle’s UI lab, and Andy Mendelsohn made a major push to improve Oracle’s tools. But basically through the latter 1990s Microsoft indeed had a major administrative usability advantage.
Second, the implication is not that data moves from DWs to Hadoop; I don’t know why you think that, especially as my argument to the contrary in a recent post was widely supported, notably by a range of Hadoop luminaries.
Third, while you’re right that operational/release DBAing and modeling aren’t the same thing, they tend to point the same way when you’re selecting a DBMS architecture to buy into. A database that is complicated to model in the first place is likely to be complicated to administer as well, because it’s likely to have more tables, more indexes on those tables, and so on. The correlation isn’t 1.0, but it sure is positive.
I think we’re mostly agreeing on MS SQL – it was a tool that couldn’t get past enterprise rules and got in as part of departmental byways, where it was an adequate db, but with easier admin and rich bundled tools (the point of appealing to local non-top management and to developers seems very similar to many newer dbmss. Another point, IBM had several working dbmss that needed little admin and didn’t sell much of these at the same time.)
My misunderstanding on the second point.
The complex models/multiple stakeholders vs. for purpose is the big issue newer db(ms) are confronting. If they are that simple, they are much easier to deal with front to back. Admin sees them as filesystems or caches, there is no complexity of multiple stakeholders. They are probably only interesting if they provide a capability not seen yet – and that would be a great topic for discussion.
You give an example of an in-memory ddbms. My experience with these is poor; testing against complex workloads these don’t dramatically perform better than traditional rdbms (even if in simple workloads they may do 1-2 orders of magnitude better; scaling tests are often worse as well.) This seems to argue that complex tech is hard to do, and many are successful at caching, and some at FS 🙂
A lot is positioning and focus. Informix SE was very competitive with Progress — and I definitely include the respective 4GLs in that — but Informix defocused on that business to compete at the high end.
SQL Server grew because Microsoft realised that DBAs are viewed as a cost and in most organisations, perform little more activity than server builds and setting up backup and maintenance jobs. Back in the day, they performed a whole host of additional mundane tasks which Microsoft realised could easily be automated.
Furthermore the real money-making battleground was SMEs rather than corporates for whom the cost of a DBA was a greater percentage of overall IT budget.
They therefore set about inventing as close to a “DBA-less” database product as they could, opting to make the database as self-tuning (e.g. auto-updating stats) and set-and-forget (e.g. auto-expanding files) as possible, whilst still leaving the “knobs-on” for people that wanted to dig deeper.
Microsoft also leveraged its toolset integration across the development stack, so it all works together easily, making it the preferred choice for developers on the .NET platform.
Making a product easy for developers encouraged 3rd party software package producers to develop for SQL Server, and this too played a big part in forcing SQL Server into reluctant corporates. To this day there are many corporates whose primary database may be DB2 or Oracle, but whom have major SQL Server estates primarily due to the profusion of third party packaged software on that platform.
In the financial world, they went after Sybase – and the effect has been DEVASTATING to Sybase sales, with bank-after-bank implementing Sybase to Microsoft migrations (to my knowledge, there is only one major investment bank that still has Sybase as a primary platform).
[…] Tokutek has planted a small stake there too. […]