Terminology: Transparent sharding
When databases are too big to manage via a single server, responsibility for them is spread among multiple servers. There are numerous names for this strategy, or versions of it — all of them at least somewhat problematic. The most common terms include:
- (Shared-nothing) MPP (Massively Parallel Processing), often used to describe analytic DBMS. On the whole, these terms have worked pretty well, but they have issues even so. First, “MPP” means different things to different marketers. Second, most ostensibly “shared-nothing” systems aren’t really “shared-nothing.” They generally support at least storage arrays, if not storage-area networks (SANs); indeed, in a couple of cases (most notably EMC Greenplum), SAN support is prominent in their marketing message.
- (Horizontal) partitioning and/or data distribution. These have significant problems. “Partitioning” and “distribution” are easily confused with each other, not least because the term “partitioning” is used in different ways by different DBMS product vendors.
- Sharding, commonly used to describe scaled-out MySQL in Internet Request Processing use cases. This one has the advantage of being concise, but is beginning to mean two different things, in that it is used both when the data is REALLY in separate databases on different machines (i.e., the application has to explicitly reference the shard it wants to talk to) and also when the database is transparently distributed (e.g. via dbShards).
- Coherent caching and/or distributed shared memory, describing cases when data is in RAM. Besides being RAM-specific, these terms can be vague as to whether the same data is recopied onto different systems, or whether they are focused on letting (relatively) large in-memory data stores be spread across a cluster.
I plan to start using the term transparent sharding to denote a data management strategy in which data is assigned to multiple servers (or CPUs, cores, etc.), yet looks to programmers and applications as if it were managed by just one. Thus,
- dbShards and ScaleBase feature transparent sharding (this is the case which inspired me to introduce the term).
- Anything which has ever reasonably been called a “shared-nothing” MPP DBMS features transparent sharding.
- Memcached features transparent sharding. So, I imagine, do other caching systems I am less familiar with.
- Shared-disk DBMS do not feature transparent sharding, even if their query work can be scaled out across multiple servers. (But Oracle Exadata does, because of its server tier.)
At the moment, I don’t see much benefit to introducing the term “transparent sharding” into discussions of analytic DBMS; rather, it’s targeted at the Internet Request Processing case. But as the terms “MPP” and “shared-nothing” get ever less useful, I could at some point change my mind.*
*One reason not to switch terms: “MPP” is marvelously concise. 🙂
What do you think of this terminology? Comments will be extremely welcome. But please be so kind as to recall one thing — no technology category definition can ever be perfect.
Comments
27 Responses to “Terminology: Transparent sharding”
Leave a Reply
Transparent sharding is a good term; I am biased, having used it for a while. It is a very attractive idea conceptually.
However, I would argue that no one is currently delivering on that promise. Specifically, neither join operations nor spatial data models are sharding oblivious in the cases you cite. Transparent sharding is not just about not explicitly selecting shard keys, it is also about the database behaving as though it is not hash or range partitioned. Operations that become inexplicably pathological if they are partitioned in fact are not very “transparent”. If your join operation is not linearly separable across a thousand compute nodes then the sharding will not be transparent — you will pay a steep price for the fact that the operation is sharded.
I only know of two companies with technology to transparently shard joins and only one that can transparently shard spatial data models. One released first product a couple weeks ago, the other will be releasing first product in a couple months. It is an important concept but delivering it requires transparency for basic computer science problems that are traditionally considered not distributable.
Transparent sharding is not about those operations that are easy to distribute, it is about those operations that are difficult to distribute.
Recently, I asked a bunch of my friends who are involved with MPP database systems whether they felt that “sharding” referred to (a) something transparent to the application, or (b) something the application is aware of. As you might imagine, there isn’t consensus. One said that “sharding” is (b), and (a) is “parallel database”. Two others said (a). I was particularly interested to see what Rick Cattell would say; he has written the best paper taxonomizing the new data stores. And (drum roll) he says it’s used both ways, and they aren’t even that different. I think you’re wise to use the term “transparent sharding” to clear this up; I’ll start doing that too.
The crux of the biscuit seems to be whether you have to worry about the joins that cross the shards, or whether that is done reasonably well without attention to this detail. If you have to think about how to put data into the database in order for the joins needed by the application to work, then that’s “transparent” in the sense that you have to see through to the implementation. I think you mean “transparent” as in “no worries”, though.
Yes, I agree with Mike. “Transparent” can indicate that the application “sees” the sharding which I don’t think is what you mean (also, I might be one of the people that Dan Weinreb is talking about in that I claimed that the definition of sharding implies that that application is aware of the partitioning function). I think “invisible sharding” might be better a better way to describe DBShards, VoltDB, my lab’s research on deterministic databases, etc.
Look at Clustrix. While still small, they provide msssively scalable transparent sharding for real OLTP applications.
Dan, Daniel, et al.,
First of all, I mean LOGICALLY “transparent” only. ALL system design has performance implications; the idea that one could distribute or shard data without performance consequences is a pipe dream, and not at all what I have in mind. I’ve been using the term “transparent” that way since two-phase-commit-based distribution was rolled out in the DBMS market in the 1980s.
Second, “transparent” and “invisible” are very close synonyms. I’d say the main difference is that “transparent” is a little less absolute; if you look through a transparent window, you still probably see that it’s there, but the same isn’t true of something invisible. Given that, I’d say “transparent” is the somewhat more appropriate word, to honor the fact that there will of course be some performance implications to logically transparent sharding.
I agree with Daniel that transparent sharding is confusing. Some in the industry call it autosharding…
Regarding MPP… well MPP and sharding is conceptually so different technologies. I even will go so far to say that it is not “…names for this strategy, or versions of it…” because they are so different.
With MPP every node communicates with any other node to implement queries. I would argue that MPP is conceptually similar to single DBMS (with parallel-query support) running on a supercomputer. Especially with high high-performance low-latency networks like Infiniband.
With autosharding nodes are just storage nodes, albeit a smart ones with predicate pushdown support. The “switch node” is actually the DBMS, executing the query and in many cases just retrieves the whole dataset from storage nodes, in other “predicate pushdown” helps a lot. Half-jokingly, the proper term for autosharding is “distributed storage with predicate pushdown” 🙂
I wonder if remote-table support of SQLServer Oracle and even PostgreSQL/GridSQL are valid form of autosharding? Seriously, you can connect (as remote tables) a lot of DBMS into a tree (or even a DAG) and freely query them… completely transparently and with joins.
Is that sharding thing invented because MySQL hadn’t proper remote table support?
Camuel,
You raise good points. Thanks.
That said, we were already calling analytic DBMS “MPP” when their inter-node communication was weak and the “fat head” nodes had way too much to do.
Also, I suspect the term “auto-sharding” means something different than you think it does. Specifically, I think it refers to a feature that’s very good to have when you also have transparent sharding, namely that data gets somehow resharded intelligently and/or in the background when resharding is needed.
Very well written.
I think the major news here, is that we should stop looking at the term “sharding” as a technical data-access-layer development solution, but rather a complete transparent MPP database solution.
If we look this way, then sharding is just the beginning. Other “secret sauces” are required such as synchronization, joins, aggregation of results (ORDER BY, GROUP BY, LIMIT, COUNT(*), etc.). For example, keeping some data replicated along several shards might reduce 90% of cross-shard joins.
Curt,
I agree, categorization is not always clearly cut and fat-heads MPP are bordering case, as may be the rare cases of sharding where pretty complex SQL are sent to the nodes instead of simpler lookups and bulk retrievals.
regarding autosharding: I confess I don’t know… I also confess I don’t see much difference between sharding and partitioning and if ORM such as hibernate is used, particularly with coherent transactional caching. I don’t understand why that ORM doesn’t constitute sharding.
Camuel,
I’m proposing extending the meaning or at least usage of “sharding” over time, once we have the transparent/non-transparent distinction worked out.
I’m going with the word “sharding” rather than “partitioning” or “distribution” for three reasons:
1. Shorter. 🙂
2. Less overloaded (yes, the other choices are worse in that regard).
3. The term of choice in the area where clarification was most needed (what I’m now calling IRP).
yep, that makes sense.
my understanding is limited in scope by deep knowledge of only 4-5 vendors in this space, but my perception is that the terms Distribution and Sharding are used to describe different concepts.
I’ve always associated sharding with clustering – the scale-out approach to clustering.
Cluster – USers are associated with a cluster instance or some form of workload balancing directs users or queries to an instance.
Sharding – Cluster becomes smarter – objects can be partitioned across instances in the cluster. Partition is range,list,hash in standard DDL. Queries can be directed by partition qualification (restriction, or join if sophisticated enough). A single user query always executes in 1 and only 1 database instance at any one time.
When a sharded enviroment starts executing a single user query, in one or many pieces, across more than one instance concurrently it becomes an “MPP” database environment.
I’ve always associated distribution with hash allocation of tables over many shared-nothing nodes as an automatic feature – this is not a function of the standard SQL DDL (which partitioninig is). Maybe a better way of saying this is that a distributedt able in an MPP system has the same DDL on every node – a sharded table has different DDL on every cluster instance (only 1 of many partition ranges). In a distributed MPP database scheme an engine must exist to re-write queries to enforce distribution compatibility. This allows a user query to run on many instances in parallel.
A scan of one date partition in a distributed MPP database will execute in roughly 1/nodecount of the time it would execute in a parition sharded scan of the same data partition. THis is because the distributed table partition is hashed across many nodes (distribution is applied before partitioning) – the sharded parition exists in whole on 1 node.
Eric,
You’re citing common uses for each term. But “cluster” — which can be both a noun and a verb — is seriously overloaded. I was surprised to learn that “sharding” isn’t just what the world’s biggest MySQL sites do. You made a small error in saying that the kind of distribution you are talking about is always based on a hash — it can be purely round-robin as well, and that’s in fact the only option in some systems (e.g. Kognitio). Beyond that, the term “distribution” is overloaded as well.
Terminology is never perfect. 😉
[…] right. That still leaves a lot of options for massive short-request databases, however, including transparently sharded RDBMS, scale-out in-memory DBMS (whether or not VoltDB*), and various NoSQL options. If nothing […]
[…] For retrofitting apps with transparent sharding,there is dbshards and ScaleBase. GA_googleAddAttr("AdOpt", "1"); GA_googleAddAttr("Origin", […]
[…] an obvious improvement on requiring application developers to write both to memcached and to non-transparently-sharded MySQL. The main technical points in adding persistence seem to have […]
is years that MySQL provides a product with built-in transparent sharding, is called MySQL Cluster:
http://www.mysql.com/products/cluster/
Transparent sharding is a good name and is 4 years we are using it explaining MySQL Cluster.
[…] Local scale-out (transparent sharding). […]
[…] released single-box only. Transparent sharding is under development for release in the fall. Basic replication is under development […]
[…] released single-box only. Transparent sharding is under development for release in the fall. Basic replication is under development […]
[…] I’ve noticed that transparent sharding is being referred to as database virtualization, especially by ParElastic. Transparent sharding is […]
[…] Transparent sharding systems that can be used with, for example, MySQL. […]
[…] expect that to change. And even if it doesn’t, one could use TokuDB in conjunction with a transparent sharding tool such as […]
[…] Deep guys have plans and designs for scale-out — transparent sharding and so […]
Hi there, this weekend is good for me, as this point in time i
am reading this wonderful educational paragraph here at my residence.
Here is my blog post – moulinex friteuse filter
[…] All MaxScale sharding is transparent. […]