Snowflake Computing
I talked with the Snowflake Computing guys Friday. For starters:
- Snowflake is offering an analytic DBMS on a SaaS (Software as a Service) basis.
- The Snowflake DBMS is built from scratch (as opposed, to for example, being based on PostgreSQL or Hadoop).
- The Snowflake DBMS is columnar and append-only, as has become common for analytic RDBMS.
- Snowflake claims excellent SQL coverage for a 1.0 product.
- Snowflake, the company, has:
- 50 people.
- A similar number of current or past users.
- 5 referenceable customers.
- 2 techie founders out of Oracle, plus Marcin Zukowski.
- Bob Muglia as CEO.
Much of the Snowflake story can be summarized as cloud/elastic/simple/cheap.*
*Excuse me — inexpensive. Companies rarely like their products to be labeled as “cheap”.
In addition to its purely relational functionality, Snowflake accepts poly-structured data. Notes on that start:
- Ingest formats are JSON, XML or AVRO for now.
- I gather that the system automagically decides which fields/attributes are sufficiently repeated to be broken out as separate columns; also, there’s a column for the documents themselves.
I don’t know enough details to judge whether I’d call that an example of schema-on-need.
A key element of Snowflake’s poly-structured data story seems to be lateral views. I’m not too clear on that concept, but I gather:
- A lateral view is something like a join on a table function, inner or outer join as the case may be.
- “Lateral view” is an Oracle term, while “Cross apply” is the term for the same thing in Microsoft SQL Server.
- Lateral views are one of the ways of making SQL handle hierarchical data structures (others evidently are WITH and CONNECT BY).
Lateral views seem central to how Snowflake handles nested data structures. I presume Snowflake also uses or plans to use them in more traditional ways (subqueries, table functions, and/or complex FROM clauses).
If anybody has a good link explaining lateral views, please be so kind as to share! Elementary googling isn’t turning much up, and the Snowflake folks didn’t send over anything clearer than this and this.
Highlights of Snowflake’s cloud/elastic/simple/inexpensive story include:
- Snowflake’s product is SaaS-only for the foreseeable future.
- Data is stored in compressed 16 megabyte files on Amazon S3, and pulled into Amazon EC2 servers for query execution on an as-needed basis. Allegedly …
- … this makes data storage significantly cheaper than it would be in, for example, an Amazon version of HDFS (Hadoop Distributed File System).
- When you fire up Snowflake, you get a “virtual data warehouse” across one or more nodes. You can have multiple “virtual data warehouses” accessing identical or overlapping sets of data. Each of these “virtual data warehouses” has a physical copy of the data; i.e., this is not related to the Oliver Ratzesberger concept of a virtual data mart defined by workload management.
- Snowflake has no indexes. It does have zone maps, aka data skipping. (Speaking of simple/inexpensive — both those aspects remind me of Netezza.)
- Snowflake doesn’t distribute data on any kind of key. I.e. it’s round-robin. (I think that’s accurate; they didn’t have time to get back to me and confirm.)
- This is not an in-memory story. Data pulled onto Snowflake’s EC2 nodes will commonly wind up in their local storage.
Snowflake pricing is based on the sum of:
- Per EC2 server-hour, for a couple classes of node.
- Per S3 terabyte-month of compressed storage.
Right now the cheaper class of EC2 node uses spinning disk, while the more expensive uses flash; soon they’ll both use flash.
DBMS 1.0 versions are notoriously immature, but Snowflake seems — or at least seems to think it is — further ahead than is typical.
- Snowflake’s optimizer is fully cost-based.
- Snowflake thinks it has strong SQL coverage, including a large fraction of SQL 2003 Analytics. Apparently Snowflake has run every TPC-H and TPC-DS query in-house, except that one TPC-DS query relied on a funky rewrite or something like that.
- Snowflake bravely thinks that it’s licked concurrency from Day 1; you just fire up multiple identical virtual DWs if needed to handle the query load. (Note: The set of Version 1 DBMS without concurrent-usage bottlenecks has cardinality very close to 0.)
- Similarly, Snowflake encourages you to fire up a separate load-only DW instance, and load mainly through trickle feeds.
- Snowflake’s SaaS-only deployment obviates — or at least obscures 🙂 — a variety of management, administration, etc. features that often are lacking in early DBMS releases.
Other DBMS technology notes include:
- Compression is columnar (various algorithms, including file-at-a-time dictionary/token).
- Joins and other database operations are performed on compressed data. (Yay!)
- Those 16-megabyte files are column-organized and immutable. This strongly suggests which kinds of writes can or can’t be done efficiently. 🙂 Note that adding a column — perhaps of derived data — is one of the things that could go well.
- There’s some kind of conflict resolution if multiple virtual DWs try to write the same records — but as per the previous point, the kinds of writes for which that’s an issue should be rare anyway.
In the end, a lot boils down to how attractive Snowflake’s prices wind up being. What I can say now is:
- I don’t actually know Snowflake’s pricing …
- … nor the amount of work it can do per node.
- It’s hard to imagine that passing queries from EC2 to S3 is going to give great performance. So Snowflake is more likely to do well when whatever parts of the database wind up being “cached” in the flash of the EC2 servers suffice to answer most queries.
- In theory, Snowflake could offer aggressive loss-leader pricing for a while. But nobody should make a major strategic bet on Snowflake’s offerings unless it shows it has a sustainable business model.
Comments
5 Responses to “Snowflake Computing”
Leave a Reply
[…] Snowflake Computing launched with a cloud strategy. […]
How may I connect informatica to snowflake database?
Here is link about a Later View. It’s specific to Hive but it has an easy to read explanation/example. https://cwiki.apache.org/confluence/display/Hive/LanguageManual+LateralView
Doesn’t seem like you’ve done much followup coverage on Snowflake. A little surprising. I look to you for the smart take on so many things, I’m disappointed :/ Any reason?
Snowflake now offers all customers access to both 100TB and 10TB versions of TPC-DS (and all 99 of its query patterns) for exploring SQL compatibility and performance in the cloud.
Details are here: https://www.snowflake.net/tpc-ds-now-available-snowflake-samples/
We believe this is the largest sample database available on any platform today, and proves the scalability and cost-effectiveness of Snowflake’s unique architecture for analytic workloads.