May 1, 2014
MemSQL update
I stopped by MemSQL last week, and got a range of new or clarified information. For starters:
- Even though MemSQL (the product) was originally designed for OLTP (OnLine Transaction Processing), MemSQL (the company) is now focused on analytic use cases …
- … which was the point of introducing MemSQL’s flash-based columnar option.
- One MemSQL customer has a 100 TB “data warehouse” installation on Amazon.
- Another has “dozens” of terabytes of data spread across 500 machines, which aggregate 36 TB of RAM.
- At customer Shutterstock, 1000s of non-MemSQL nodes are monitored by 4 MemSQL machines.
- A couple of MemSQL’s top references are also Vertica flagship customers; one of course is Zynga.
- MemSQL reports encountering Clustrix and VoltDB in a few competitive situations, but not NuoDB. MemSQL believes that VoltDB is still hampered by its traditional issues — Java, reliance on stored procedures, etc.
On the more technical side:
- Some MemSQL users are running 7- or 8-way joins and other long-ish SQL statements.
- But MemSQL doesn’t yet have fully peer-to-peer data redistribution.
- MemSQL “leaves” only talk to MemSQL “aggregator nodes,” not each other …
- … but note the plural on “aggregator nodes”, which should immunize MemSQL from the worst of “fat head” bottlenecks.
- Of course, you can sometimes get join locality by sharding multiple tables on the same key …
- … or by broadcast-replicating tables that are sufficiently small.
- Better SQL coverage — e.g. SQL Windowing — is coming soon.
- MemSQL believes it has an aggressive data skipping story.
- MemSQL doesn’t yet have a true workload management story; they’re still at the stage “Our queries run so fast not many of them have to be active at once, and if things nevertheless get too busy we have some throttling capabilities.” But MemSQL at least sounds aware of the difference between that and true workload management, which puts them ahead of some other vendors I talk with.
- MemSQL doesn’t have stored procedures. In particular, since MemSQL (the product) generates code on the fly, MemSQL (the company) doesn’t think the performance benefits of stored procedure pre-compilation are needed.
And finally, MemSQL’s column-store compression story — which I mangled in a previous post — goes like this:
- There are numerous compression algorithm choices, both columnar (e.g. dictionary/tokenization, run-length encoding) and block (Lempel-Ziv, I presume in multiple variations).
- Compression is block-by-block, something I hear more commonly these days than Vertica’s alternative of global compression choices.
- The choice of compression scheme is automagic for each block, unless you give explicit hints.
- Default block size for the columnar store is 10 million rows.
Categories: Clustering, Clustrix, Columnar database management, Data warehousing, Database compression, In-memory DBMS, MemSQL, NewSQL, NuoDB, Specific users, Vertica Systems, VoltDB and H-Store, Workload management, Zynga
Subscribe to our complete feed!
Comments
18 Responses to “MemSQL update”
Leave a Reply
Don’t forget that MemSQL can spin up on in AWS in 10 mins!
Companies love the simplicity and flexibility of multiple deployment options on commodity hardware.
To clarify the VoltDB misinformation, VoltDB can be run by end users without writing a single stored procedure or ever directly interacting with Java. If used this way VoltDB runs single-statement transactions and does not support external transaction control, exactly like MemSQL. There is literally no “there” there to the Java and stored procedure argument when comparing VoltDB to a system without multi-statement transactions (like MemSQL or most NoSQL).
If a user does use simple stored procedures written in Java or Groovy, we support complex procedural logic with many SQL statements in a single, ACID-compliant transaction. This is something users can’t do with MemSQL, so it’s weird they’re eager to call attention to it.
Thanks Curt. Very informative.So why does world need another columnar MPP DBMS?
It is very interesting to see Memsql pivoting towards Analytics and do a columnar. Sounds me too. Aren’t there already enough (ParAccel, Vertica etc) columnar MPP solutions in the market which are way more mature than memsql?
WRT the aggregator nodes, doesn’t seem to be intuitive architecture. Let’s take an example. Assume 10 leaf nodes and 4 aggregator nodes. Assuming that first phase aggregate happens on the leaf node and then results are shipped to aggregator nodes and final aggregation takes place. How does this architecture scale in case of high cardinality aggregates or count distincts?
Looks like Memsql is trying to figure out the market and is confused which way to go (OLTP or Analytics or both).
Interesting where investors are placing bets:
MemSQL’s Series B was $35m
http://gigaom.com/2014/01/22/fast-growing-database-startup-memsql-raises-35m/
VoltDB’s Series B was $8m
http://www.prnewswire.com/news-releases/voltdb-closes-8-million-series-b-round-of-funding-to-accelerate-sales-and-marketing-and-lead-the-smart-revolution-250614101.html
Jason, unless you can also link to term sheets, I’m not sure there’s a judgement to make here.
In the meantime, I’m happy to make any comparisons based on the value of the products to their users. VoltDB 4.2 is our fastest, lowest-latency-est, most space efficient and most featureful release yet. Of course we do internal comparisons to systems like MemSQL and love what we see in all of those dimensions, but we heartily encourage users to do the same.
John,
I think a lot of the action in data management the past few years has been around doing some level of analytics on very fresh data. The associated marketing term is commonly some variant of “real-time”.
But then, even classical OLTP apps typically have a lot of analytics in them. As I often point out, SAP told me almost a decade ago that over half of their apps’ processing was actually analytic (reporting, planning, whatever). Indeed, there are plenty of kinds of OLTP workflow that use reports as their starting point.
Curt,
I’m not sure if the point your trying to make is that transactions/procedures are not valuable in analytic workloads, or if you think VoltDB is an OLTP-focused system. I’ll address both.
VoltDB has been messaging Real-Time Analytics for some time, and has stepped up this messaging for our 4.0 launch significantly. The majority of our users are using VoltDB to glean insight from their live data. We’ve put significant engineering time into SQL-accessible data structures to enable powerful decisioning, such as materialized views, ranking indexes and function-based indexes. For aggregations and leaderboards, VoltDB’s performance is unmatched. We’re rolling out an example in 4.3 next week showing instant responses to queries on pre-aggregated moving time-windows.
It’s also fair to say that a majority of our users are using VoltDB as an OLTP store, often as the system of record. This is enabled by VoltDB’s tremendous performance, mature high-availabilty and disk-durability features. While some users use VoltDB as a consistent key-value or document store, it seems natural that many of our users are leveraging both OLTP and Real-Time Analytical models at the same time. These use cases typically involve high update volume combined with live reporting / altering on the OLTP store.
So why are transactions and procedures useful in Real-Time Analytics? Mostly because running a SQL insert statement (or bulk-loader) is just such a waste of an opportunity to make decisions at the point of ingestion.
We have multiple customers processing sensor networks directly with VoltDB. In one example, an app de-dups and filters RFID sensor triggers acceding to business logic, making a data firehose into a data stream. In another, an app updates a running calculation based on new sensor data using complex math software in the Java portion of the procedure.
Besides filtering, processing and enriching, stored procedures enable alerting and chained actions beyond what’s possible using SQL triggers. Statistics about a significant portion of all IP traffic in Japan is going though a single VoltDB node who’s only job is to detect DDOS attempts faster and cheaper than other systems.
Ouch. A “your” vs “you’re” in the first sentence. You win Friday. You win.
Not to mention “statistics … is”. 😉
As for the rest — it’s been a while since I talked with you guys. Maybe we should fix that. Do you have a competent and civil marketing guy these days, or should we keep it strictly to engineering? 🙂
Yeah, rushing to get home on a Friday night.
I’ll talk to the new marketing people and get back to you. 😉
what does this mean ?
At customer Shutterstock, 1000s of non-MemSQL nodes are monitored by 4 MemSQL machines.
Thanks Curt. Memsql being downloadable, product can be tested easily :)). With their current version,I would love to see the actual query and rows in each table where they claim they ran 7 to 8 way joins. At the same time, it is a bit strange that without any workload management, Memsql claims to be able to run analytics and OLTP on same cluster. I guess analytics is a being used loosely .
Out of the 3 vendors mentioned in this post,Clustrix seem to have most comprehensive coverage for SQL and ACID capabilities. Any thoughts on that product?
Curt, one more clarification , post mentions “Dozen of TB across 500 Nodes”. Can you please elaborate on this? What was the node configuration for these and concurrency for analytics? Sounds a bit too much HW for dozen of TB. Reason could be inherent Memsql engine issues requiring this much HW.
ddorian,
Increasingly, a use for short-request data stores, SQL or NoSQL as the case may be, is to monitor devices. That can get up into the millions of simple devices. In the Shutterstock use case, it seems to be thousands of more complex ones.
John,
We’re talking about software that was, until recently, in-memory only. I’m not seeing the same arithmetic problem you are.
I don’t know why the voltdb guys to make it a point that they also have traditional jdbc support in lieu of stored procedures, the fact is that you even admit, that using jdbc will result in much worse performance vs. stored java procedures. What makes memsql neat, if true, is that, you can get the high-performance of an in-memory database with ACID, without being limited by stored procedures in your application.
John Hugg,
Is VoltDB written in Java?
Apologies, I can’t find that info on VoltDB’s site.
Thanks
[…] MemSQL post led to a vigorous comparison of MemSQL vs. […]