MemSQL 4.0
I talked with my clients at MemSQL about the release of MemSQL 4.0. Let’s start with the reminders:
- MemSQL started out as in-memory OTLP (OnLine Transaction Processing) DBMS …
- … but quickly positioned with “We also do ‘real-time’ analytic processing” …
- … and backed that up by adding a flash-based column store option …
- … before Gartner ever got around to popularizing the term HTAP (Hybrid Transaction and Analytic Processing).
- There’s also a JSON option.
The main new aspects of MemSQL 4.0 are:
- Geospatial indexing. This is for me the most interesting part.
- A new optimizer and, I suppose, query planner …
- … which in particular allow for serious distributed joins.
- Some rather parallel-sounding connectors to Spark. Hadoop and Amazon S3.
- Usual-suspect stuff including:
- More SQL coverage (I forgot to ask for details).
- Some added or enhanced administrative/tuning/whatever tools (again, I forgot to ask for details).
- Surely some general Bottleneck Whack-A-Mole.
There’s also a new free MemSQL “Community Edition”. MemSQL hopes you’ll experiment with this but not use it in production. And MemSQL pricing is now wholly based on RAM usage, so the column store is quasi-free from a licensing standpoint is as well.
Before MemSQL 4.0, distributed joins were restricted to the easy cases:
- Two tables are distributed (i.e. sharded) on the same key.
- One table is small enough to be broadcast to each node.
Now arbitrary tables can be joined, with data reshuffling as needed. Notes on MemSQL 4.0 joins include:
- Join algorithms are currently nested-loop and hash, and in “narrow cases” also merge.
- MemSQL fondly believes that its in-memory indexes work very well for nested-loop joins.
- The new optimizer is fully cost-based (but I didn’t get much clarity as to the cost estimators for JSON).
- MemSQL’s indexing scheme, skip lists, had histograms anyway, with the cutesy name skiplistogram.
- MemSQL’s queries have always been compiled, and of course have to be planned before compilation. However, there’s a little bit of plan flexibility built in based on the specific values queried for, aka “parameter-sensitive plans” or “run-time plan choosing”.
To understand the Spark/MemSQL connector, recall that MemSQL has “leaf” nodes, which store data, and “aggregator” nodes, which combine query results and ship them back to the requesting client. The Spark/MemSQL connector manages to skip the aggregation step, instead shipping data directly from the various MemSQL leaf nodes to a Spark cluster. In the other direction, a Spark RDD can be saved into MemSQL as a table. This is also somehow parallel, and can be configured either as a batch update or as an append; intermediate “conflict resolution” policies are possible as well.
In other connectivity notes:
- MemSQL’s idea of a lambda architecture involves a Kafka stream, with data likely being stored twice (in Hadoop and MemSQL).
- MemSQL likes and supports the Spark DataFrame API, and says financial trading firms are already using it.
Other application areas cited for streaming/lambda kinds of architectures are — you guessed it! — ad-tech and “anomaly detection”.
And now to the geospatial stuff. I thought I heard:
- A “point” is actually a square region less than 1 mm per side.
- There are on the order of 2^30 such points on the surface of the Earth.
Given that Earth’s surface area is a little over 500,000,000 square meters, I’d think 2^50 would be a better figure, but fortunately that discrepancy doesn’t matter to the rest of the discussion. (Edit: As per a comment below, that’s actually square kilometers, so unless I made further errors we’re up to the 2^70 range.)
Anyhow, if the two popular alternatives for geospatial indexing are R-trees or space-filling curves, MemSQL favors the latter. (One issue MemSQL sees with R-trees is concurrency.) Notes on space-filling curves start:
- In this context, a space-filling curve is a sequential numbering of points in a higher-dimensional space. (In MemSQL’s case, the dimension is two.)
- Hilbert curves seem to be in vogue, including at MemSQL.
- Nice properties of Hilbert space-filling curves include:
- Numbers near each other always correspond to points near each other.
- The converse is almost always true as well.*
- If you take a sequence of numbers that is simply the set of all possibilities with a particular prefix string, that will correspond to a square region. (The shorter the prefix, the larger the square.)
*You could say it’s true except in edge cases … but then you’d deserve to be punished.
Given all that, my understanding of the way MemSQL indexes geospatial stuff — specifically points and polygons — is:
- Points have numbers assigned to them by the space-filling curve; those are indexed in MemSQL’s usual way. (Skip lists.)
- A polygon is represented by its vertices. Take the longest prefix they share. That could be used to index them (you’d retrieve a square region that includes the polygon). But actually …
- … a polygon is covered by a union of such special square regions, and indexed accordingly, and I neglected to ask exactly how the covering set of squares was chosen.
As for company metrics — MemSQL cites >50 customers and >60 employees.
Related links
- I’ve posted about earlier versions of MemSQL technology, e.g. in May, 2014, April, 2013 and June, 2012.
Comments
10 Responses to “MemSQL 4.0”
Leave a Reply
Curt,
small correction: “… the Earth’s surface area is a little over 500,000,000 square _kilometers_ …”
Regards,
Heiko
Thanks, Heiko! Corrected above.
I believe that “in memory” data can be both efficient for OLTP and for OLAP, something disk can not give us… Having said that – it is a challenge to build reliable system which allow data mutation.
Vertica already has a in memory row store and disk based column store. Doesn’t seem anything new here. Most of the OLTP workloads are still not out growing current architectures as much as analytical workloads did. I suspect that Memsql use cases are more analytical.
I am also a bit skeptical about running OLTP and analytics on same box even if data is memory. Being in memory vs on SSD isn’t a big difference in scan speeds for large data sets. Memsql column store isn’t in memory anyways.
I agree with what Curt says about geospatial stuff being the
interesting part.
Running analytic and OLTP in the same box do require very good level of resource management. In the same time – it is very valuable for the customers, since save them one more system to support, data replication, etc.
John,
Zynga is a flagship Vertica customer. They then became a flagship MemSQL customer as well.
I’d also note that the architecture is quite different. MemSQL has some tables in the row store and others in the column store. It’s like Teradata in that regard, except that in Teradata one wouldn’t expect either part to be in-memory-only, and one wouldn’t be surprised at either part being on spinning disk, and one wouldn’t choose to update Teradata at OLTP speeds in any case.
Vertica, by way of contrast, has the same tables in both, with the row store being a kind of write cache for the column store.
I don’t understand how they can claim to support “sophisticated analytical SQL queries” if they don’t even support window functions (_very_ important for analytical queries), common table expressions or recursive queries.
Hans,
I agree that that claim should be counted against MemSQL’s marketing BS budget, for your reasons.
On the plus side, I have reasonable confidence in MemSQL when it comes to taking care of obvious roadmap items.
[…] Processing), your OLTP RDBMS vendor surely has a story worth listening to. Memory-centric offerings MemSQL and SAP HANA are also pitched that […]
[…] short-request-capable data stores to also capture some analytic workloads. E.g., this is central to MemSQL’s pitch, and to some NoSQL applications as […]