Notes on Microsoft SQL Server
I’ve been known to gripe that covering big companies such as Microsoft is hard. Still, Doug Leland of Microsoft’s SQL Server team checked in for phone calls in August and again today, and I think I got enough to be worth writing about, albeit at a survey level only,
Subjects I’ll mention include:
- Hadoop
- Parallel Data Warehouse
- PolyBase
- Columnar data management
- In-memory data management (Hekaton)
One topic I can’t yet comment about is MOLAP/ROLAP, which is a pity; if anybody can refute my claim that ROLAP trumps MOLAP, it’s either Microsoft or Oracle.
Microsoft’s slides mentioned Yahoo refining a 6 petabyte Hadoop cluster into a 24 terabyte SQL Server “cube”, which was surprising in light of Yahoo’s history as an Oracle reference.
But first we need some housekeeping. As best I understood Microsoft’s lingo:
- Microsoft talks about selling in three form factors, collectively “ABC”:
- A = Appliance, which is how PDW (Parallel Data Warehouse, nee’ DATAllegro) is sold, in partnership with either Dell or HP.
- B = Box, which is catchy word for “software”.
- C = Cloud*
- Names of major releases go with years — SQL Server 2005, 2008, 2012.
- Timing on the next major SQL Server release hasn’t been disclosed yet …
- … but hopefully will be clarified in the first half of 2013.
- In the mean time, it’s safe to say that it’s a small number of years away, not a small number of quarters.
- Point releases of SQL Server are called “Service Packs”, and Service Pack 1 for SQL Server 2012 is now generally available.
- Public betas for Azure are called “preview”, and that lingo has slipped into other form factors as well.
- Microsoft’s Hadoop efforts are called HDInsight, across at least the Box and Cloud form factors.
*I.e. Azure; pay no attention to dictionaries and poets, who say that skies are azure, while clouds are puffy white.
Microsoft’s Hadoop/HDInsight story starts with what you’d expect:
- You can get it in the cloud or on-premises.
- Hortonworks did a lot of the work.
- Microsoft does Tier 1 support; Hortonworks does Tiers 2 & 3.
The first level of HDInsight management tools will be based on Ambari and donated back to Apache open source, but you might want to integrate the use of those with Microsoft’s long-standing proprietary management suites.
Notes on SQL Server Parallel Data Warehouse include:
- PDW apparently has real reference customers and so on.
- PDW now uses DAS (Direct-Attached Storage) and the like, versus a previous strategy of simulating shared-nothing on a SAN (Storage Area Network).
What sounds like it might be cool is PolyBase, a PDW extension comparable to Hadapt or Teradata Aster SQL-H. Notes on that start:
- Amusingly, PolyBase was developed in the lab of famed MapReduce skeptic Dave DeWitt.
- PolyBase development has been underway for around 18 months.
- PolyBase will ship with the next release of PDW, scheduled for the first half of 2013.
Technically, I gather:
- It has or is a “new query processor” for PDW.
- HDFS (Hadoop Distributed File System) now will look like an external table to SQL Server.
- SQL Server’s query planner/cost-based optimizer has the choice of either pulling data from HDFS into SQL Server, or kicking off MapReduce jobs straight in Hadoop.
I didn’t ask whether HDFS and SQL Server live on the same nodes, ala Hadapt, or different ones, ala Teradata Aster — but I’m guessing the latter, based on Microsoft’s PolyBase page.
And by the way — if SQL Server has significant analytic platform capabilities, nobody’s ever briefed me on them. To the extent it doesn’t, PolyBase/Hadoop might evolve into a partial substitute.
Microsoft SQL Server has for a while had a columnar capability, kludged from its indexing system. The big limitations were:
- The column store was read-only.
- You had to have a row-organized version of the data sitting around somewhere.
Both those restrictions are being lifted — initially just in PDW appliances, but later in the “box” products as well. Naturally, Microsoft reports that compression is great, calling it “10X” just like the other cool columnar kids now do. At one point there were hasty mentions of “vector processing” and something that sounds like Netezza zone maps, but I didn’t get details of either.
Actually, I suspect there’s a bit of kludge left in there somewhere, as the no-row-based-version feature is “optional”, and the column store is being described as a “clustered index”.
That takes us to Hekaton, which is already in “preview” with about 100 customers even though it won’t be generally available until the next major SQL Server release a few years out. As on other subjects, I lack detail, but I gather that Hekaton has some serious in-memory DBMS design features. Specifically mentioned were the absence of locking and latching.
A key point is that you only have to move some of your tables into Hekaton; you can manage the rest on disk as you always did. This may be regarded as somewhere in between storage tiering and full federation, in that SQL Server is one DBMS, but can invoke several very different storage engines.
And that’s all I have for now. Greater substance may or may not follow.
Related links
- A vehement, multi-party debate on SAN versus DAS (2008)
- DATAllegro’s one and only production customer (2009)
- What Netezza zone maps and range partitioning evolved into (2010)
- Andrew Brust’s post on PolyBase (last month)
- Microsoft’s blog post on Hekaton (also last month)
Comments
10 Responses to “Notes on Microsoft SQL Server”
Leave a Reply
Thanks for the update Curt. The SAN versus DAS link provides very helpful background information and industry discussion…
Regards,
Al D.
When you write “(Parallel Data Warehouse, nee’ DATAllegro)” does this mean that DATAllegro technology made it into the PDW product? I am skeptical that much would be useful from a product that re-wrote SQL before sending it to mostly unmodified Ingres nodes and then collected/summarized the results.
Mark,
The details on that are unknowable from the outside. Microsoft told me in the mid-1990s that they’d already celebrate “Sybase Liberation Day”, because the last lines of Sybase code were gone. A decade later that still wasn’t wholly accurate. More to the point, Sybase “technology” persists in the product long after the code does.
Anyhow, the row-based new analytic DBMS — Netezza, Greenplum, DATAllegro, Aster — followed similar paths. The question is how far each got, when, and how many scratches they picked up from brambles along the way. So while you’re surely correct to imply heavy rewriting and replacement, that doesn’t mean the DATAllegro seed isn’t still in there somewhere.
Anyhow, porting is the least of it; the major issues were the sophistication needed that DATAllegro hadn’t gotten around to.
Regardless, it is nice to see Microsoft innovate to keep up in the world of data management. Thanks for the summary.
Mark,
1) There is no DatAllegro code left in PDW. None. As Naughton’s characterizes the system, a classic example of the “stone soap” parable.
2) For details on how queries are executed in PDW works see our SIGMOD 2012 paper on query optimization. We still do send SQL fragments to individual nodes as was done in DA but the code that does this is entirely new. The reason we have retained this approach is that it avoids forcing all nodes to execute exactly the same plan step for the SQL fragment (a portion of entire SQL query). THis has proven to be very useful when the data on the different nodes is highly skewed. It has some inefficiencies but frankly in a data warehousing environment the overhead of the approach is noise.
David
I guess I need to read the paper now.
Using SQL for query fragments is very interesting. I encountered a DBMS that migrated from using SQL to something else to do the opposite of what PDW wants. It wanted to guarantee that the same plans were used on all nodes. But this was before the days of big data.
Is there a lot of research that needs to be done for PQ optimization? Using different plans can be a problem when a plan is blocking on one node and not on other nodes, or when a plan produces ordered results on when node and not on other nodes.
Thanks for the update Curt – an interesting read
Is PDW a recommended platform for running OLTP workloads?
Or would Microsoft typically advocate SQL Server on a non-appliance platform while Hekaton is being developed?
Stuart,
I would think the latter. Since PDW seems to plan every database operation twice (once globally and once at every individual node), it’s not ideal for transactional updates.
[…] e.g. Greenplum (especially early on), Aster (ditto), DATAllegro, DATAllegro’s offspring Microsoft PDW (Parallel Data Warehouse), or […]
[…] Microsoft has now launched Hekaton, about which I long ago wrote: I lack detail, but I gather that Hekaton has some serious in-memory DBMS design features. […]