July 15, 2011
Soundbites: the Facebook/MySQL/NoSQL/VoltDB/Stonebraker flap, continued
As a follow-up to the latest Stonebraker kerfuffle, Derrick Harris asked me a bunch of smart followup questions. My responses and afterthoughts include:
- Facebook et al. are in effect Software as a Service (SaaS) vendors, not enterprise technology users. In particular:
- They have the technical chops to rewrite their code as needed.
- Unlike packaged software vendors, they’re not answerable to anybody for keeping legacy code alive after a rewrite. That makes migration a lot easier.
- If they want to write different parts of their system on different technical underpinnings, nobody can stop them. For example …
- … Facebook innovated Cassandra, and is now heavily committed to HBase.
- It makes little sense to talk of Facebook’s use of “MySQL.” Better to talk of Facebook’s use of “MySQL + memcached + non-transparent sharding.” That said:
- It’s hard to see why somebody today would use MySQL + memcached + non-transparent sharding for a new project. At least one of Couchbase or transparently-sharded MySQL is very likely a superior alternative. Other alternatives might be better yet.
- As noted above in the example of Facebook, the many major web businesses that are using MySQL + memcached + non-transparent sharding for existing projects can be presumed able to migrate away from that stack as the need arises.
Continuing with that discussion of DBMS alternatives:
- If you just want to write to the memcached API anyway, why not go with Couchbase?
- If you want to go relational, why not go with MySQL? There are many alternatives for scaling or accelerating MySQL — dbShards, Schooner, Akiban, Tokutek, ScaleBase, ScaleDB, Clustrix, and Xeround come to mind quickly, so there’s a great chance that one or more will fit your use case. (And if you don’t get the choice of MySQL flavor right the first time, porting to another one shouldn’t be all THAT awful.)
- If you really, really want to go in-memory, and don’t mind writing Java stored procedures, and don’t need to do the kinds of joins it isn’t good at, but do need to do the kinds of joins it is, VoltDB could indeed be a good alternative.
And while we’re at it — going schema-free often makes a whole lot of sense. I need to write much more about the point, but for now let’s just say that I look favorably on the Big Four schema-free/NoSQL options of MongoDB, Couchbase, HBase, and Cassandra.
Categories: Akiban, Cache, Cassandra, Clustrix, Couchbase, Data models and architecture, Database diversity, dbShards and CodeFutures, Facebook, HBase, In-memory DBMS, memcached, Michael Stonebraker, MongoDB, NoSQL, Open source, ScaleBase, ScaleDB, Schooner Information Technology, Software as a Service (SaaS), Tokutek and TokuDB, VoltDB and H-Store
Subscribe to our complete feed!
Comments
19 Responses to “Soundbites: the Facebook/MySQL/NoSQL/VoltDB/Stonebraker flap, continued”
Leave a Reply
Don’t forget that Facebook created Apache Hive.
I haven’t forgotten that, but I’m not too clear on its relevance. 😉
Derrick Harris’s comments are totally spot-on. And if you have ever heard Facebook’s standard talk about how this all works (I saw it at MIT), that’s exactly the point: how they use MySQL and memcache and sharing. And why they do it that way. If you get a chance to hear that talk, do not miss it.
A key point is (I am 90% sure I’ve got this right) that they are NOT using MySQL because of its ability to do arbitrary queries! They use MySQL as a persistent key-value store. Evidently InnoDB is very good at this and the front-end MySQL doen’t get in the way. I was quite surprised when friends of mine told me that MySQL is great at this. You wold think that a system optimized to just be a key-value store would beat a more general engine hands down. But the friend telling me this (1) has total technical cred, and (2) has written his own optimized key-value store and found that MySQL is basically just as good. So the whole premise about the virtues of SQL is beside the point.
Here’s what I really want to know: is Facebook actually suffering from a problem with its data infrastructure? Is there actually a problem to be solved AT ALL? If so, I’d like to know more about it. Stonebraker’s comments don’t go into this; they seem to assume tacitly that Facebook is in trouble. But if so, I have not heard about it.
http://www.dbms2.com/2010/08/22/workday-technology-stack/ is another great example of MySQL used as a key-value store.
That said — if those decisions were being made today, I imagine Couchbase would turn out to be a better choice. But it wasn’t available back then.
I guess I’m not sure what’s different about the joins VoltDB does and the joins that other scale-out SQL solutions offer. Which of the multi-node systems offer arbitrary cross-node joins with acceptable performance? I’m not aware of any, but maybe I’m not up on the latest.
VoltDB supports most reasonable joins that are common in OLTP, with more on the way. Due to it’s explict partitioning and classification of replicated and partitioned data, VoltDB can often do those joins faster than other systems can work with de-normalized data.
As for the stored procedure interface, while some people like designing the dbms layer as a data API of sorts, we understand it’s often viewed as a limitiation. However, when we talk to people about scaling transactional apps to tens of thousands or hundreds of thousands of transactions per second, the stored procedure interface is usually not a primary concern. People either embrace it or use VoltDB as a transactional key-value store and skip the interface all together.
Finally, I’d like to point out that of the scale-out SQL systems you mentioned, I think VoltDB is the only that is open source, though Akiban seems to be gearing up to go open.
John,
Mainly, I’m resisting the apparent claim that VoltDB somehow solves problems other scale-out relational DBMS do not. Most particularly, I’m objecting to what Mike was (hopefully inaccurately) cited as saying, namely that Dwight Merriman’s quite accurate comments about joins were badly mistaken.
Curt,
I’m not sure what Dwight said exactly, so I can’t speak to that. There are certainly those who reject cross-node joins and/or cross node transactions entirely, and I don’t think that makes a lot of sense.
Sure, there are examples of both that are extremely difficult or impossible to scale, but it turns out there are plenty of useful things you can do with cross-node joins/txns at scale. Even with a few restrictions, apps that mix single and multi-node operations look a lot richer than those limited to single-node ops only.
John,
I think you guys made the right choice when you deviated away from H-Store in the ways you just said.
As for doing without joins at all — there are certainly use cases when this is fine “just because”. There are further use cases in which, by the time you’ve jumped through the hoops necessary for performance/scale, doing without joins is little or no additional annoyance.
And there are plenty of use cases where doing without joins would be crazy.
(Disclosure, I’m a MySQL employee)
Interesting post as usual! You mention CouchBase a few times in this article, but as far as I can see that is currently the new name for ‘CouchDB’ and there is not yet an integrated CouchDB/Membase hybrid? So perhaps you mean ‘Membase’? Maybe worth clarifying.
One nice property of a SQL RDBMS is that a project can start with a flexible, rich, transactional, complex and non scalable schema. If/when the need to scale arises, the parts of the schema which need to scale can be refactored to use the SQL RDBMS as a sharded key value store, without needing to change the technology stack. There can be a progressive series of tradeoffs between performance and functionality, rather than a single step. It would be uneconomic to do this if the RDBMS were not an efficient key value store, or licence costs for this use case were too high.
Regarding licence costs, complexity etc, a critical factor often not quoted is efficiency – how many nodes, how much network IO, how much disk IO to achieve a given throughput level. I think Daniel Abadi mentioned this contrast between traditional RDBMS discussing scaling to tens of nodes vs NoSQL projects mentioning thousands – to what extent is this due to greater RDBMS efficiency rather than some fundamental ‘scale limit’?
Having a caching layer separate to the persistent data storage layer introduces extra moving parts etc, but that decoupling can be valuable, and it gives more control. You can combine ‘best of breed’ systems, you can isolate some cascading failures etc. Decoupling of layers and manual sharding require more effort when building a system, but can act as a valuable constraint on system design, forcing consideration of the scaling issues upfront, better modularity etc.
It’s easy to imagine one system with everything built in, stable, free, open source, with widespread availability of management competence etc. Bringing a system like this into existence while the state of the best-of-breed components continues to evolve takes a lot of work which needs to be supported somehow until the product/ecosystem is large enough to support itself.
On the subject of the merits of an RDBMS as a key value store, what opinions do you hold on MySQL Cluster?
Frazer,
Somebody who wants to buy Couchbase, the product, today, can surely get it.
And I think a lot more of Couchbase than I do of pre-Couchbase Membase, because of the increased power in data manipulation. It falls short of SQL, obviously, but at least it’s useful for nice variety of tasks.
Frazer,
Nobody’s ever convinced me that MySQL Cluster has much to recommend it except in the telecom use case it was defined for.
As for your general theme that one shouldn’t rule out relational options out of hand, I agree. If there’s a specific reason to go NoSQL, such as an active desire to avoid schemas, fine. Otherwise, relational possibilities may be quite competitive.
Frazer – Couchbase is a product per http://www.couchbase.com/products-and-services/couchbase-single-server. Among other things they quote 2 to 5k ops/second/node.
Curt, Mark – Ok. I was wondering whether there was some integration between Membase/CouchDb yet – it looks like the current CouchBase is a variant of CouchDb. Will be interesting to see what results if/when they combine these products in some way. Perhaps the ops/second/node can climb a little.
Well, I’m under NDA as to the timing. Dunno if Mark is as well.
[…] This post has a sequel. […]
Disclosure: I work on SciDB, another Stonebraker initiative.
Well, we’re about to get something of a real world experiment here. The reason we have schemas at all is to make logical/physical distinctions in the stack. SQL and Relational DBMSs are extremely popular in “business” data processing because they make it possible to build flexible data management platforms with (adequate) scalability. Business IT types generally avoid “no schema” data management platforms because without a schema you can’t write queries and without queries it’s very hard to do important things like functional evolution in your application, decision support reporting, second-applications, or schema integration.
Now, Google+ has showed up, with a “hot new competitive feature” that allows users to organize their friends lists. ( Or so I’m told. I don’t have a Facebook page and I don’t have a Google+ account.) Facebook’s now operating in a competitive environment, and the flexibility of it’s data management platform is about to be tested as it tries to build something like this new “feature”.
If the MySQL + Memcache + lots-of-application-code platform is adequate, then Facebook should be able to roll out something competitive. If, on the other hand, schemas and the logical/physical division are important, it will take them more time.
Let’s wait and see.
Paul,
Perhaps you’re thinking of the telecom disaster in the pre-relational 1980s, when AT&T couldn’t match MCI’s Friends and Families marketing/pricing program for several quarters, and lost significant market share as a result. But I don’t think this is the same thing.
Getting the logical functionality to meet Google+ is the least of the issues, and scaling the logical functionality is only the second-biggest. UI/product packaging is a bigger issue in this case.
If you poke around the intertubes, you’ll find various ex-Googlers complaining that the Google infrastructure gets in the way of innovation. But that’s because it’s mandated the same infrastructure be used for all things. Facebook is more open to changing the infrastructure as an application group deems that to be necessary.
[…] DBMS2 on Facebook/Stonebraker flap GA_googleAddAttr("AdOpt", "1"); GA_googleAddAttr("Origin", "other"); GA_googleAddAttr("theme_bg", "ffffff"); GA_googleAddAttr("theme_border", "dddddd"); GA_googleAddAttr("theme_text", "444444"); GA_googleAddAttr("theme_link", "ac6c13"); GA_googleAddAttr("theme_url", "6ab32e"); GA_googleAddAttr("LangId", "1"); GA_googleAddAttr("Autotag", "technology"); GA_googleAddAttr("Tag", "database"); GA_googleFillSlot("wpcom_below_post"); […]
[…] Courtesy: DBMS2 […]