February 24, 2011

Terminology: Transparent sharding

When databases are too big to manage via a single server, responsibility for them is spread among multiple servers. There are numerous names for this strategy, or versions of it — all of them at least somewhat problematic. The most common terms include:

(Shared-nothing) MPP (Massively Parallel Processing), often used to describe analytic DBMS. On the whole, these terms have worked pretty well, but they have issues even so. First, “MPP” means different things to different marketers. Second, most ostensibly “shared-nothing” systems aren’t really “shared-nothing.” They generally support at least storage arrays, if not storage-area networks (SANs); indeed, in a couple of cases (most notably EMC Greenplum), SAN support is prominent in their marketing message.
(Horizontal) partitioning and/or data distribution. These have significant problems. “Partitioning” and “distribution” are easily confused with each other, not least because the term “partitioning” is used in different ways by different DBMS product vendors.
Sharding, commonly used to describe scaled-out MySQL in Internet Request Processing use cases. This one has the advantage of being concise, but is beginning to mean two different things, in that it is used both when the data is REALLY in separate databases on different machines (i.e., the application has to explicitly reference the shard it wants to talk to) and also when the database is transparently distributed (e.g. via dbShards).
Coherent caching and/or distributed shared memory, describing cases when data is in RAM. Besides being RAM-specific, these terms can be vague as to whether the same data is recopied onto different systems, or whether they are focused on letting (relatively) large in-memory data stores be spread across a cluster.

I plan to start using the term transparent sharding to denote a data management strategy in which data is assigned to multiple servers (or CPUs, cores, etc.), yet looks to programmers and applications as if it were managed by just one. Thus,

dbShards and ScaleBase feature transparent sharding (this is the case which inspired me to introduce the term).
Anything which has ever reasonably been called a “shared-nothing” MPP DBMS features transparent sharding.
Memcached features transparent sharding. So, I imagine, do other caching systems I am less familiar with.
Shared-disk DBMS do not feature transparent sharding, even if their query work can be scaled out across multiple servers. (But Oracle Exadata does, because of its server tier.)

At the moment, I don’t see much benefit to introducing the term “transparent sharding” into discussions of analytic DBMS; rather, it’s targeted at the Internet Request Processing case. But as the terms “MPP” and “shared-nothing” get ever less useful, I could at some point change my mind.*

*One reason not to switch terms: “MPP” is marvelously concise. 🙂

What do you think of this terminology? Comments will be extremely welcome. But please be so kind as to recall one thing — no technology category definition can ever be perfect.

Categories: Parallelization, Transparent sharding

Subscribe to our complete feed!

Comments

27 Responses to “Terminology: Transparent sharding”

J. Andrew Rogers on February 25th, 2011 1:28 am

Transparent sharding is a good term; I am biased, having used it for a while. It is a very attractive idea conceptually.

However, I would argue that no one is currently delivering on that promise. Specifically, neither join operations nor spatial data models are sharding oblivious in the cases you cite. Transparent sharding is not just about not explicitly selecting shard keys, it is also about the database behaving as though it is not hash or range partitioned. Operations that become inexplicably pathological if they are partitioned in fact are not very “transparent”. If your join operation is not linearly separable across a thousand compute nodes then the sharding will not be transparent — you will pay a steep price for the fact that the operation is sharded.

I only know of two companies with technology to transparently shard joins and only one that can transparently shard spatial data models. One released first product a couple weeks ago, the other will be releasing first product in a couple months. It is an important concept but delivering it requires transparency for basic computer science problems that are traditionally considered not distributable.

Transparent sharding is not about those operations that are easy to distribute, it is about those operations that are difficult to distribute.
Dan Weinreb on February 25th, 2011 8:41 am

Recently, I asked a bunch of my friends who are involved with MPP database systems whether they felt that “sharding” referred to (a) something transparent to the application, or (b) something the application is aware of. As you might imagine, there isn’t consensus. One said that “sharding” is (b), and (a) is “parallel database”. Two others said (a). I was particularly interested to see what Rick Cattell would say; he has written the best paper taxonomizing the new data stores. And (drum roll) he says it’s used both ways, and they aren’t even that different. I think you’re wise to use the term “transparent sharding” to clear this up; I’ll start doing that too.
Mike Beckerle on February 25th, 2011 9:10 am

The crux of the biscuit seems to be whether you have to worry about the joins that cross the shards, or whether that is done reasonably well without attention to this detail. If you have to think about how to put data into the database in order for the joins needed by the application to work, then that’s “transparent” in the sense that you have to see through to the implementation. I think you mean “transparent” as in “no worries”, though.
Daniel Abadi on February 25th, 2011 9:59 am

Yes, I agree with Mike. “Transparent” can indicate that the application “sees” the sharding which I don’t think is what you mean (also, I might be one of the people that Dan Weinreb is talking about in that I claimed that the definition of sharding implies that that application is aware of the partitioning function). I think “invisible sharding” might be better a better way to describe DBShards, VoltDB, my lab’s research on deterministic databases, etc.
Jim Peters on February 25th, 2011 10:12 am

Look at Clustrix. While still small, they provide msssively scalable transparent sharding for real OLTP applications.
Curt Monash on February 25th, 2011 2:12 pm

Dan, Daniel, et al.,

First of all, I mean LOGICALLY “transparent” only. ALL system design has performance implications; the idea that one could distribute or shard data without performance consequences is a pipe dream, and not at all what I have in mind. I’ve been using the term “transparent” that way since two-phase-commit-based distribution was rolled out in the DBMS market in the 1980s.

Second, “transparent” and “invisible” are very close synonyms. I’d say the main difference is that “transparent” is a little less absolute; if you look through a transparent window, you still probably see that it’s there, but the same isn’t true of something invisible. Given that, I’d say “transparent” is the somewhat more appropriate word, to honor the fact that there will of course be some performance implications to logically transparent sharding.
Camuel Gilyadov on February 25th, 2011 2:20 pm

I agree with Daniel that transparent sharding is confusing. Some in the industry call it autosharding…

Regarding MPP… well MPP and sharding is conceptually so different technologies. I even will go so far to say that it is not “…names for this strategy, or versions of it…” because they are so different.

With MPP every node communicates with any other node to implement queries. I would argue that MPP is conceptually similar to single DBMS (with parallel-query support) running on a supercomputer. Especially with high high-performance low-latency networks like Infiniband.

With autosharding nodes are just storage nodes, albeit a smart ones with predicate pushdown support. The “switch node” is actually the DBMS, executing the query and in many cases just retrieves the whole dataset from storage nodes, in other “predicate pushdown” helps a lot. Half-jokingly, the proper term for autosharding is “distributed storage with predicate pushdown” 🙂

I wonder if remote-table support of SQLServer Oracle and even PostgreSQL/GridSQL are valid form of autosharding? Seriously, you can connect (as remote tables) a lot of DBMS into a tree (or even a DAG) and freely query them… completely transparently and with joins.

Is that sharding thing invented because MySQL hadn’t proper remote table support?
Curt Monash on February 25th, 2011 2:37 pm

Camuel,

You raise good points. Thanks.

That said, we were already calling analytic DBMS “MPP” when their inter-node communication was weak and the “fat head” nodes had way too much to do.

Also, I suspect the term “auto-sharding” means something different than you think it does. Specifically, I think it refers to a feature that’s very good to have when you also have transparent sharding, namely that data gets somehow resharded intelligently and/or in the background when resharding is needed.
Doron Levari on February 25th, 2011 2:45 pm

Very well written.

I think the major news here, is that we should stop looking at the term “sharding” as a technical data-access-layer development solution, but rather a complete transparent MPP database solution.
If we look this way, then sharding is just the beginning. Other “secret sauces” are required such as synchronization, joins, aggregation of results (ORDER BY, GROUP BY, LIMIT, COUNT(*), etc.). For example, keeping some data replicated along several shards might reduce 90% of cross-shard joins.
Camuel Gilyadov on February 25th, 2011 3:17 pm

Curt,

I agree, categorization is not always clearly cut and fat-heads MPP are bordering case, as may be the rare cases of sharding where pretty complex SQL are sent to the nodes instead of simpler lookups and bulk retrievals.

regarding autosharding: I confess I don’t know… I also confess I don’t see much difference between sharding and partitioning and if ORM such as hibernate is used, particularly with coherent transactional caching. I don’t understand why that ORM doesn’t constitute sharding.
Curt Monash on February 25th, 2011 4:28 pm

Camuel,

I’m proposing extending the meaning or at least usage of “sharding” over time, once we have the transparent/non-transparent distinction worked out.

I’m going with the word “sharding” rather than “partitioning” or “distribution” for three reasons:

1. Shorter. 🙂
2. Less overloaded (yes, the other choices are worse in that regard).
3. The term of choice in the area where clarification was most needed (what I’m now calling IRP).
Camuel Gilyadov on February 25th, 2011 7:11 pm

yep, that makes sense.
Eric Kraemer on March 3rd, 2011 2:14 pm

my understanding is limited in scope by deep knowledge of only 4-5 vendors in this space, but my perception is that the terms Distribution and Sharding are used to describe different concepts.

I’ve always associated sharding with clustering – the scale-out approach to clustering.

Cluster – USers are associated with a cluster instance or some form of workload balancing directs users or queries to an instance.

Sharding – Cluster becomes smarter – objects can be partitioned across instances in the cluster. Partition is range,list,hash in standard DDL. Queries can be directed by partition qualification (restriction, or join if sophisticated enough). A single user query always executes in 1 and only 1 database instance at any one time.

When a sharded enviroment starts executing a single user query, in one or many pieces, across more than one instance concurrently it becomes an “MPP” database environment.

I’ve always associated distribution with hash allocation of tables over many shared-nothing nodes as an automatic feature – this is not a function of the standard SQL DDL (which partitioninig is). Maybe a better way of saying this is that a distributedt able in an MPP system has the same DDL on every node – a sharded table has different DDL on every cluster instance (only 1 of many partition ranges). In a distributed MPP database scheme an engine must exist to re-write queries to enforce distribution compatibility. This allows a user query to run on many instances in parallel.

A scan of one date partition in a distributed MPP database will execute in roughly 1/nodecount of the time it would execute in a parition sharded scan of the same data partition. THis is because the distributed table partition is hashed across many nodes (distribution is applied before partitioning) – the sharded parition exists in whole on 1 node.
Curt Monash on March 3rd, 2011 8:15 pm

Eric,

You’re citing common uses for each term. But “cluster” — which can be both a noun and a verb — is seriously overloaded. I was surprised to learn that “sharding” isn’t just what the world’s biggest MySQL sites do. You made a small error in saying that the kind of distribution you are talking about is always based on a hash — it can be purely round-robin as well, and that’s in fact the only option in some systems (e.g. Kognitio). Beyond that, the term “distribution” is overloaded as well.

Terminology is never perfect. 😉
An odd claim attributed to Mike Stonebraker | DBMS 2 : DataBase Management System Services on July 21st, 2011 7:48 pm

[…] right. That still leaves a lot of options for massive short-request databases, however, including transparently sharded RDBMS, scale-out in-memory DBMS (whether or not VoltDB*), and various NoSQL options. If nothing […]
Scaling MySQL « TechLedger on July 25th, 2011 10:45 pm

[…] For retrofitting apps with transparent sharding,there is dbshards and ScaleBase. GA_googleAddAttr("AdOpt", "1"); GA_googleAddAttr("Origin", […]
Couchbase technical update | DBMS 2 : DataBase Management System Services on August 13th, 2011 11:09 pm

[…] an obvious improvement on requiring application developers to write both to memcached and to non-transparently-sharded MySQL. The main technical points in adding persistence seem to have […]
Massimo on September 12th, 2011 11:33 am

is years that MySQL provides a product with built-in transparent sharding, is called MySQL Cluster:

http://www.mysql.com/products/cluster/

Transparent sharding is a good name and is 4 years we are using it explaining MySQL Cluster.
Are there any remaining reasons to put new OLTP applications on disk? | DBMS 2 : DataBase Management System Services on September 19th, 2011 1:07 pm

[…] Local scale-out (transparent sharding). […]
Introduction to MemSQL | DBMS 2 : DataBase Management System Services on June 18th, 2012 7:35 am

[…] released single-box only. Transparent sharding is under development for release in the fall. Basic replication is under development […]
Introduction to MemSQL « VietHiP on June 22nd, 2012 11:48 am

[…] released single-box only. Transparent sharding is under development for release in the fall. Basic replication is under development […]
Data(base) virtualization — a terminological mess | DBMS 2 : DataBase Management System Services on January 5th, 2013 12:50 pm

[…] I’ve noticed that transparent sharding is being referred to as database virtualization, especially by ParElastic. Transparent sharding is […]
NewSQL thoughts | DBMS 2 : DataBase Management System Services on January 5th, 2013 3:25 pm

[…] Transparent sharding systems that can be used with, for example, MySQL. […]
Tokutek update | DBMS 2 : DataBase Management System Services on January 15th, 2013 3:39 am

[…] expect that to change. And even if it doesn’t, one could use TokuDB in conjunction with a transparent sharding tool such as […]
Introduction to Deep Information Sciences and DeepDB | DBMS 2 : DataBase Management System Services on April 17th, 2013 1:29 pm

[…] Deep guys have plans and designs for scale-out — transparent sharding and so […]
moulinex friteuse filter on September 23rd, 2014 5:19 am

Hi there, this weekend is good for me, as this point in time i
am reading this wonderful educational paragraph here at my residence.

Here is my blog post – moulinex friteuse filter
MariaDB and MaxScale | DBMS 2 : DataBase Management System Services on April 10th, 2015 4:20 pm

[…] All MaxScale sharding is transparent. […]

Leave a Reply

Search our blogs and white papers

Monash Research blogs

DBMS 2 covers database management, analytics, and related technologies.
Text Technologies covers text mining, search, and social software.
Strategic Messaging analyzes marketing and messaging strategy.
The Monash Report examines technology and public policy issues.
Software Memories recounts the history of the software industry.

User consulting

Building a short list? Refining your strategic plan? We can help.

Vendor advisory

We tell vendors what's happening -- and, more important, what they should do about it.

Monash Research highlights

Learn about white papers, webcasts, and blog highlights, by RSS or email.

Links
- Monash Research
- White Papers
Admin
- Log in

Terminology: Transparent sharding

Comments

Search our blogs and white papers

Monash Research blogs

User consulting

Vendor advisory

Monash Research highlights

Recent posts

Categories

Date archives

Admin