May 20, 2015

MemSQL 4.0

I talked with my clients at MemSQL about the release of MemSQL 4.0. Let’s start with the reminders:

MemSQL started out as in-memory OTLP (OnLine Transaction Processing) DBMS …
… but quickly positioned with “We also do ‘real-time’ analytic processing” …
… and backed that up by adding a flash-based column store option …
… before Gartner ever got around to popularizing the term HTAP (Hybrid Transaction and Analytic Processing).
There’s also a JSON option.

The main new aspects of MemSQL 4.0 are:

Geospatial indexing. This is for me the most interesting part.
A new optimizer and, I suppose, query planner …
… which in particular allow for serious distributed joins.
Some rather parallel-sounding connectors to Spark. Hadoop and Amazon S3.
Usual-suspect stuff including:
- More SQL coverage (I forgot to ask for details).
- Some added or enhanced administrative/tuning/whatever tools (again, I forgot to ask for details).
- Surely some general Bottleneck Whack-A-Mole.

There’s also a new free MemSQL “Community Edition”. MemSQL hopes you’ll experiment with this but not use it in production. And MemSQL pricing is now wholly based on RAM usage, so the column store is quasi-free from a licensing standpoint is as well.

Before MemSQL 4.0, distributed joins were restricted to the easy cases:

Two tables are distributed (i.e. sharded) on the same key.
One table is small enough to be broadcast to each node.

Now arbitrary tables can be joined, with data reshuffling as needed. Notes on MemSQL 4.0 joins include:

Join algorithms are currently nested-loop and hash, and in “narrow cases” also merge.
MemSQL fondly believes that its in-memory indexes work very well for nested-loop joins.
The new optimizer is fully cost-based (but I didn’t get much clarity as to the cost estimators for JSON).
MemSQL’s indexing scheme, skip lists, had histograms anyway, with the cutesy name skiplistogram.
MemSQL’s queries have always been compiled, and of course have to be planned before compilation. However, there’s a little bit of plan flexibility built in based on the specific values queried for, aka “parameter-sensitive plans” or “run-time plan choosing”.

To understand the Spark/MemSQL connector, recall that MemSQL has “leaf” nodes, which store data, and “aggregator” nodes, which combine query results and ship them back to the requesting client. The Spark/MemSQL connector manages to skip the aggregation step, instead shipping data directly from the various MemSQL leaf nodes to a Spark cluster. In the other direction, a Spark RDD can be saved into MemSQL as a table. This is also somehow parallel, and can be configured either as a batch update or as an append; intermediate “conflict resolution” policies are possible as well.

In other connectivity notes:

MemSQL’s idea of a lambda architecture involves a Kafka stream, with data likely being stored twice (in Hadoop and MemSQL).
MemSQL likes and supports the Spark DataFrame API, and says financial trading firms are already using it.

Other application areas cited for streaming/lambda kinds of architectures are — you guessed it! — ad-tech and “anomaly detection”.

And now to the geospatial stuff. I thought I heard:

A “point” is actually a square region less than 1 mm per side.
There are on the order of 2^30 such points on the surface of the Earth.

Given that Earth’s surface area is a little over 500,000,000 square meters, I’d think 2^50 would be a better figure, but fortunately that discrepancy doesn’t matter to the rest of the discussion. (Edit: As per a comment below, that’s actually square kilometers, so unless I made further errors we’re up to the 2^70 range.)

Anyhow, if the two popular alternatives for geospatial indexing are R-trees or space-filling curves, MemSQL favors the latter. (One issue MemSQL sees with R-trees is concurrency.) Notes on space-filling curves start:

In this context, a space-filling curve is a sequential numbering of points in a higher-dimensional space. (In MemSQL’s case, the dimension is two.)
Hilbert curves seem to be in vogue, including at MemSQL.
Nice properties of Hilbert space-filling curves include:
- Numbers near each other always correspond to points near each other.
- The converse is almost always true as well.*
- If you take a sequence of numbers that is simply the set of all possibilities with a particular prefix string, that will correspond to a square region. (The shorter the prefix, the larger the square.)

*You could say it’s true except in edge cases … but then you’d deserve to be punished.

Given all that, my understanding of the way MemSQL indexes geospatial stuff — specifically points and polygons — is:

Points have numbers assigned to them by the space-filling curve; those are indexed in MemSQL’s usual way. (Skip lists.)
A polygon is represented by its vertices. Take the longest prefix they share. That could be used to index them (you’d retrieve a square region that includes the polygon). But actually …
… a polygon is covered by a union of such special square regions, and indexed accordingly, and I neglected to ask exactly how the covering set of squares was chosen.

As for company metrics — MemSQL cites >50 customers and >60 employees.

Related links

I’ve posted about earlier versions of MemSQL technology, e.g. in May, 2014, April, 2013 and June, 2012.

Categories: Amazon and its cloud, Columnar database management, Databricks, Spark and BDAS, GIS and geospatial, Hadoop, Investment research and trading, Kafka and Confluent, Market share and customer counts, MemSQL, NewSQL, Pricing, Structured documents

Subscribe to our complete feed!

Comments

10 Responses to “MemSQL 4.0”

Heiko Korndorf on May 20th, 2015 6:33 am

Curt,
small correction: “… the Earth’s surface area is a little over 500,000,000 square _kilometers_ …”
Regards,
Heiko
Curt Monash on May 20th, 2015 8:38 am

Thanks, Heiko! Corrected above.
David Gruzman on May 21st, 2015 5:17 am

I believe that “in memory” data can be both efficient for OLTP and for OLAP, something disk can not give us… Having said that – it is a challenge to build reliable system which allow data mutation.
John on May 21st, 2015 8:34 am

Vertica already has a in memory row store and disk based column store. Doesn’t seem anything new here. Most of the OLTP workloads are still not out growing current architectures as much as analytical workloads did. I suspect that Memsql use cases are more analytical.
I am also a bit skeptical about running OLTP and analytics on same box even if data is memory. Being in memory vs on SSD isn’t a big difference in scan speeds for large data sets. Memsql column store isn’t in memory anyways.
I agree with what Curt says about geospatial stuff being the
interesting part.
David Gruzman on May 21st, 2015 8:52 am

Running analytic and OLTP in the same box do require very good level of resource management. In the same time – it is very valuable for the customers, since save them one more system to support, data replication, etc.
Curt Monash on May 21st, 2015 9:18 am

John,

Zynga is a flagship Vertica customer. They then became a flagship MemSQL customer as well.

I’d also note that the architecture is quite different. MemSQL has some tables in the row store and others in the column store. It’s like Teradata in that regard, except that in Teradata one wouldn’t expect either part to be in-memory-only, and one wouldn’t be surprised at either part being on spinning disk, and one wouldn’t choose to update Teradata at OLTP speeds in any case.

Vertica, by way of contrast, has the same tables in both, with the row store being a kind of write cache for the column store.
Hans on May 26th, 2015 3:09 am

I don’t understand how they can claim to support “sophisticated analytical SQL queries” if they don’t even support window functions (_very_ important for analytical queries), common table expressions or recursive queries.
Curt Monash on May 26th, 2015 10:12 pm

Hans,

I agree that that claim should be counted against MemSQL’s marketing BS budget, for your reasons.

On the plus side, I have reasonable confidence in MemSQL when it comes to taking care of obvious roadmap items.
Are analytic RDBMS and data warehouse appliances obsolete? | DBMS 2 : DataBase Management System Services on August 28th, 2016 9:29 pm

[…] Processing), your OLTP RDBMS vendor surely has a story worth listening to. Memory-centric offerings MemSQL and SAP HANA are also pitched that […]
“Real-time” is getting real | DBMS 2 : DataBase Management System Services on September 6th, 2016 1:08 pm

[…] short-request-capable data stores to also capture some analytic workloads. E.g., this is central to MemSQL’s pitch, and to some NoSQL applications as […]

Leave a Reply

Search our blogs and white papers

Monash Research blogs

DBMS 2 covers database management, analytics, and related technologies.
Text Technologies covers text mining, search, and social software.
Strategic Messaging analyzes marketing and messaging strategy.
The Monash Report examines technology and public policy issues.
Software Memories recounts the history of the software industry.

User consulting

Building a short list? Refining your strategic plan? We can help.

Vendor advisory

We tell vendors what's happening -- and, more important, what they should do about it.

Monash Research highlights

Learn about white papers, webcasts, and blog highlights, by RSS or email.

Links
- Monash Research
- White Papers
Admin
- Log in

MemSQL 4.0

Comments

Search our blogs and white papers

Monash Research blogs

User consulting

Vendor advisory

Monash Research highlights

Recent posts

Categories

Date archives

Admin